Transforming Data¶

Example Data¶

```from datatable import dt, f, update, ifelse from datetime import date source = {"dates" : [date(2000, 1, 5), date(2010, 11, 23), date(2020, 2, 29), None], "integers" : range(1, 5), "floats" : [10.0, 11.5, 12.3, -13], "strings" : ['A', 'B', None, 'D'] } DT = dt.Frame(source) DT ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

Column Transformation¶

Operations on columns occur in the j section of `DT[i, j, by/sort/...]`, and involve the use of f-expressions:

```DT[:, f.integers * 2] ```
C0
int32
02
14
26
38
```DT[:, 'pre_' + f.strings] ```
C0
str32
0pre_A
1pre_B
2NA
3pre_D

Operation between columns is also possible:

```DT[:, f.integers + f.floats] ```
C0
float64
011
113.5
215.3
3-9
```DT[:, f.integers +'_' + f.strings] ```
C0
str32
01_A
12_B
2NA
34_D

Note

Operations between columns create new columns, which names are autogenerated, unless otherwise specified.

Various datatable functions can be applied to the columns:

```DT[:, dt.math.pow(f.floats, 3)] ```
C0
float64
01000
11520.88
21860.87
3-2197
```DT[:, dt.time.year(f.dates)] ```
dates
int32
02000
12010
22020
3NA

Most of the datatable functions operate column-wise:

```DT[:, dt.sum(f['integers':'floats'])] ```
integersfloats
int64float64
01020.8

while the ones with the row prefix operate row-wise:

```DT[:, dt.rowsum(f['integers':'floats'])] ```
C0
float64
011
113.5
215.3
3-9

Transformation of a column based on a condition is possible via `ifelse()`, which operates similarly to Python’s if-else idiom:

```DT[:, ifelse(f.integers % 2 == 0, 'even', 'odd')] ```
C0
str32
0odd
1even
2odd
3even

Transforming a column via a boolean condition in the i section is possible either; note that this could result in a reduced number of rows:

```DT[f.dates < dt.time.ymd(2020,1,1), f.integers ** 4] ```
C0
float64
01
116

Column type can be changed with the `as_type()` function

• Single column:

```DT[:, dt.as_type(f.integers, str)] ```
integers
str32
01
12
23
34
```DT[:, dt.as_type(f.integers, dt.Type.str32)] ```
integers
str32
01
12
23
34
• Multiple columns:

```DT[:, dt.as_type(f['integers', 'floats'], str)] ```
integersfloats
str32str32
0110.0
1211.5
2312.3
34-13.0

Values across multiple columns can be replaced with the `dt.Frame.replace()` method; this works on the entire frame, and is an in-place operation.

The `dt.Frame.replace()` syntax is `frame.replace(replace_what, replace_with)`, where replace_what values in the frame are replaced with replaced_with:

```DT.replace(1, -1) DT ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

For multiple values, a list or a dictionary can be used.

If a list is used, the number of entries in the replace_what list must match the number of entries in the replace_with list, or the number of entries in replace_with must be exactly 1:

```DT.replace(['A', 10.0], ['A_pre', 30.0]) DT ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
```DT.replace([2, 3], 20) DT ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-232011.5B
22020-02-292012.3NA
3NA4-13D

If a dictionary is used, the replace_what values serve as the keys, while the replace_with values are the values in the dictionary:

```DT.replace({4: 24, 'B': 'BBB'}) DT ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-23211.5BBB
22020-02-29312.3NA
3NA24-13D

The replace_what values, that are not found in the frame, are ignored:

```DT.replace({355:26}) DT ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-23211.5BBB
22020-02-29312.3NA
3NA24-13D

Have a look at `dt.Frame.replace()` for more options, especially when replacing null values.

Iteration on a Frame¶

Iterating through a `Frame` allows access to the individual columns; in this case, each column gets converted to one-column frame:

```print(*[col for col in DT], sep="\n") ```
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
integers
int32
0-1
120
220
324
floats
float64
030
111.5
212.3
3-13
strings
str32
0A_pre
1BBB
2NA
3D

With iteration, different operations can be applied to different columns:

```outcome = [frame.mean() if frame.type.is_numeric else frame[0, :] for frame in DT] outcome[0] ```
dates
date32
02000-01-05
```outcome[1] ```
integers
float64
015.75
```outcome[2] ```
floats
float64
010.2
```outcome[3] ```
strings
str32
0A_pre
```dt.cbind(outcome) ```
datesintegersfloatsstrings
date32float64float64str32
02000-01-0515.7510.2A_pre

Sorting a Frame¶

A frame can be sorted via the `sort()` function, or the `dt.Frame.sort()` method:

```DT[:, :, dt.sort('dates')] ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA
```DT.sort('dates') ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA

Sorting is also possible via f-expressions:

``` DT[:, :, dt.sort(f.floats)] ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre
```DT.sort(f.strings) ```
datesintegersfloatsstrings
date32int32float64str32
02020-02-292012.3NA
12000-01-05-130A_pre
22010-11-232011.5BBB
3NA24-13D

The default sorting order is ascending; if there are any nulls in the sorting columns, they go to the top.

The sorting order and the position of nulls can be changed in a number of ways:

• Sorting can be done in descending order via the reverse parameter:

```DT[:, :, dt.sort('integers', reverse = True)] ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre

Note

The `reverse` parameter is available only in the `sort()` function

• Sorting in descending order is also possible by negating the f-expressions within the `sort()` function, or the `dt.Frame.sort()` method:

```DT[:, :, dt.sort(-f.integers)] ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre
```DT.sort(-f.integers) ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre
• The position of null values within the sorting column can be controlled with the `na_position` parameter:

```DT[:, :, dt.sort('dates', na_position = 'last')] ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-232011.5BBB
22020-02-292012.3NA
3NA24-13D
• Rows with null values can be removed by passing remove to the `na_position` parameter:

```# only the row where date is null is removed DT[:, :, dt.sort('dates', na_position='remove')] ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-232011.5BBB
22020-02-292012.3NA

Note

The na_position parameter is available only in the `sort()` function.

Note

The default value for `na_position` is first.

Sorting is possible on multiple columns:

```DT[:, :, dt.sort('dates', 'integers')] ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA
```DT.sort('dates', 'integers') ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA
```DT[:, :, dt.sort(-f.integers, f.dates)] ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre
```DT.sort(-f.integers, f.dates) ```
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre

Column Assignment¶

Transformed columns can be assigned to new columns, or replace existing columns via direct assignment and the `update()` function. A third option, extend, applies only when creating new columns.

Direct Assignment¶

• Single column:

```DT['months'] = DT[:, dt.time.month(f.dates)] DT ```
datesintegersfloatsstringsmonths
date32int32float64str32int32
02000-01-05-130A_pre1
12010-11-232011.5BBB11
22020-02-292012.3NA2
3NA24-13DNA
• Multiple columns:

```DT[:, ['months', 'int_squared']] = DT[:, [dt.time.month(f.dates), f.integers**2]] DT ```
datesintegersfloatsstringsmonthsint_squared
date32int32float64str32int32float64
02000-01-05-130A_pre11
12010-11-232011.5BBB11400
22020-02-292012.3NA2400
3NA24-13DNA576
• Update existing column:

```DT['strings'] = DT[:, f.strings[:1]] DT ```
datesintegersfloatsstringsmonthsint_squared
date32int32float64str32int32float64
02000-01-05-130A11
12010-11-232011.5B11400
22020-02-292012.3NA2400
3NA24-13DNA576
• You can assign an f-expressions to create a new column or update an existing column:

```DT['integers'] = f.integers * 3 DT ```
datesintegersfloatsstringsmonthsint_squared
date32int32float64str32int32float64
02000-01-05-330A11
12010-11-236011.5B11400
22020-02-296012.3NA2400
3NA72-13DNA576

`update()`¶

`update()` is an in-place operation, and as such, a direct assignment is not required.

`update()` uses keyword arguments, where the key in the dictionary is the name of the new column, while the value is the f-expressions to be computed.

• Single column:

```DT[:, update(year = dt.time.year(f.dates))] DT ```
datesintegersfloatsstringsmonthsint_squaredyear
date32int32float64str32int32float64int32
02000-01-05-330A112000
12010-11-236011.5B114002010
22020-02-296012.3NA24002020
3NA72-13DNA576NA
• Multiple columns:

```DT[:, update(year = dt.time.year(f.dates), float_doubled = f.floats * 2)] DT ```
datesintegersfloatsstringsmonthsint_squaredyearfloat_doubled
date32int32float64str32int32float64int32float64
02000-01-05-330A11200060
12010-11-236011.5B11400201023
22020-02-296012.3NA2400202024.6
3NA72-13DNA576NA-26
• Update existing column:

```DT[:, update(year = f.year / 12)] DT ```
datesintegersfloatsstringsmonthsint_squaredyearfloat_doubled
date32int32float64str32int32float64float64float64
02000-01-05-330A11166.66760
12010-11-236011.5B11400167.523
22020-02-296012.3NA2400168.33324.6
3NA72-13DNA576NA-26

Extend¶

The extend method works via f-expressions to create new columns; it does not update existing columns.

The extend method uses a dictionary to create the new columns, where the key in the dictionary is the name of the new column, while the value is the f-expressions to be computed.

First, let’s trim the `DT` frame:

```DT = DT[:, :4] DT ```
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-310A
12010-11-236011.5B
22020-02-296012.3NA
3NA72-13D
• Add one column:

```DT = DT[:, f[:].extend({"months" : dt.time.month(f.dates)})] DT ```
datesintegersfloatsstringsmonths
date32int32float64str32int32
02000-01-05-330A1
12010-11-236011.5B11
22020-02-296012.3NA2
3NA72-13DNA
• Add multiple columns:

```DT = DT[:, f[:].extend({"year" : dt.time.year(f.dates), "int_squared" : f.integers ** 2})] DT ```
datesintegersfloatsstringsmonthsyearint_squared
date32int32float64str32int32int32float64
02000-01-05-330A120009
12010-11-236011.5B1120103600
22020-02-296012.3NA220203600
3NA72-13DNANA5184

Note

In contrast to `update()`, the result has to be assigned to DT.