Transforming Data¶

Example Data¶

from datatable import dt, f, update, ifelse
from datetime import date

source = {"dates" : [date(2000, 1, 5), date(2010, 11, 23), date(2020, 2, 29), None],
          "integers" : range(1, 5),
          "floats" : [10.0, 11.5, 12.3, -13],
          "strings" : ['A', 'B', None, 'D']
          }
DT = dt.Frame(source)
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns

Column Transformation¶

Operations on columns occur in the j section of DT[i, j, by/sort/...], and involve the use of f-expressions:

DT[:, f.integers * 2]
C0
int32
02
14
26
38
4 rows × 1 column
DT[:, 'pre_' + f.strings]
C0
str32
0pre_A
1pre_B
2NA
3pre_D
4 rows × 1 column

Operation between columns is also possible:

DT[:, f.integers + f.floats]
C0
float64
011
113.5
215.3
3-9
4 rows × 1 column
DT[:, f.integers +'_' + f.strings]
C0
str32
01_A
12_B
2NA
34_D
4 rows × 1 column

Note

Operations between columns create new columns, which names are autogenerated, unless otherwise specified.

Various datatable functions can be applied to the columns:

DT[:, dt.math.pow(f.floats, 3)]
C0
float64
01000
11520.88
21860.87
3-2197
4 rows × 1 column
DT[:, dt.time.year(f.dates)]
dates
int32
02000
12010
22020
3NA
4 rows × 1 column

Most of the datatable functions operate column-wise:

DT[:, dt.sum(f['integers':'floats'])]
integersfloats
int64float64
01020.8
1 row × 2 columns

while the ones with the row prefix operate row-wise:

DT[:, dt.rowsum(f['integers':'floats'])]
C0
float64
011
113.5
215.3
3-9
4 rows × 1 column

Transformation of a column based on a condition is possible via ifelse(), which operates similarly to Python’s if-else idiom:

DT[:, ifelse(f.integers % 2 == 0, 'even', 'odd')]
C0
str32
0odd
1even
2odd
3even
4 rows × 1 column

Transforming a column via a boolean condition in the i section is possible either; note that this could result in a reduced number of rows:

DT[f.dates < dt.time.ymd(2020,1,1), f.integers ** 4]
C0
float64
01
116
2 rows × 1 column

Column type can be changed with the as_type() function

Single column:

DT[:, dt.as_type(f.integers, str)]
integers
str32
0 1
1 2
2 3
3 4
4 rows × 1 column

DT[:, dt.as_type(f.integers, dt.Type.str32)]
integers
str32
0 1
1 2
2 3
3 4
4 rows × 1 column

Multiple columns:

DT[:, dt.as_type(f['integers', 'floats'], str)]
integers floats
str32 str32
0 1 10.0
1 2 11.5
2 3 12.3
3 4 -13.0
4 rows × 2 columns

Values across multiple columns can be replaced with the dt.Frame.replace() method; this works on the entire frame, and is an in-place operation.

The dt.Frame.replace() syntax is frame.replace(replace_what, replace_with), where replace_what values in the frame are replaced with replaced_with:

DT.replace(1, -1)
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns

For multiple values, a list or a dictionary can be used.

If a list is used, the number of entries in the replace_what list must match the number of entries in the replace_with list, or the number of entries in replace_with must be exactly 1:

DT.replace(['A', 10.0], ['A_pre', 30.0])
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns
DT.replace([2, 3], 20)
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-232011.5B
22020-02-292012.3NA
3NA4-13D
4 rows × 4 columns

If a dictionary is used, the replace_what values serve as the keys, while the replace_with values are the values in the dictionary:

DT.replace({4: 24, 'B': 'BBB'})
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-23211.5BBB
22020-02-29312.3NA
3NA24-13D
4 rows × 4 columns

The replace_what values, that are not found in the frame, are ignored:

DT.replace({355:26})
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-130A_pre
12010-11-23211.5BBB
22020-02-29312.3NA
3NA24-13D
4 rows × 4 columns

Have a look at dt.Frame.replace() for more options, especially when replacing null values.

Iteration on a Frame¶

Iterating through a Frame allows access to the individual columns; in this case, each column gets converted to one-column frame:

print(*[col for col in DT], sep="\n")
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
4 rows × 1 column
integers
int32
0-1
120
220
324
4 rows × 1 column
floats
float64
030
111.5
212.3
3-13
4 rows × 1 column
strings
str32
0A_pre
1BBB
2NA
3D
4 rows × 1 column

With iteration, different operations can be applied to different columns:

outcome = [frame.mean() if frame.type.is_numeric else frame[0, :] for frame in DT]
outcome[0]
dates
date32
02000-01-05
1 row × 1 column
outcome[1]
integers
float64
015.75
1 row × 1 column
outcome[2]
floats
float64
010.2
1 row × 1 column
outcome[3]
strings
str32
0A_pre
1 row × 1 column
dt.cbind(outcome)
datesintegersfloatsstrings
date32float64float64str32
02000-01-0515.7510.2A_pre
1 row × 4 columns

Sorting a Frame¶

A frame can be sorted via the sort() function, or the dt.Frame.sort() method:

DT[:, :, dt.sort('dates')]
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA
4 rows × 4 columns
DT.sort('dates')
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA
4 rows × 4 columns

Sorting is also possible via f-expressions:

 DT[:, :, dt.sort(f.floats)]
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre
4 rows × 4 columns
DT.sort(f.strings)
datesintegersfloatsstrings
date32int32float64str32
02020-02-292012.3NA
12000-01-05-130A_pre
22010-11-232011.5BBB
3NA24-13D
4 rows × 4 columns

The default sorting order is ascending; if there are any nulls in the sorting columns, they go to the top.

The sorting order and the position of nulls can be changed in a number of ways:

Sorting can be done in descending order via the reverse parameter:

DT[:, :, dt.sort('integers', reverse = True)]
dates integers floats strings
date32 int32 float64 str32
0 NA 24 -13 D
1 2010-11-23 20 11.5 BBB
2 2020-02-29 20 12.3 NA
3 2000-01-05 -1 30 A_pre
4 rows × 4 columns

Note

The reverse parameter is available only in the sort() function

Sorting in descending order is also possible by negating the f-expressions within the sort() function, or the dt.Frame.sort() method:

DT[:, :, dt.sort(-f.integers)]
dates integers floats strings
date32 int32 float64 str32
0 NA 24 -13 D
1 2010-11-23 20 11.5 BBB
2 2020-02-29 20 12.3 NA
3 2000-01-05 -1 30 A_pre
4 rows × 4 columns

DT.sort(-f.integers)
dates integers floats strings
date32 int32 float64 str32
0 NA 24 -13 D
1 2010-11-23 20 11.5 BBB
2 2020-02-29 20 12.3 NA
3 2000-01-05 -1 30 A_pre
4 rows × 4 columns
The position of null values within the sorting column can be controlled with the na_position parameter:

DT[:, :, dt.sort('dates', na_position = 'last')]
dates integers floats strings
date32 int32 float64 str32
0 2000-01-05 -1 30 A_pre
1 2010-11-23 20 11.5 BBB
2 2020-02-29 20 12.3 NA
3 NA 24 -13 D
4 rows × 4 columns
Rows with null values can be removed by passing remove to the na_position parameter:

# only the row where date is null is removed DT[:, :, dt.sort('dates', na_position='remove')]
dates integers floats strings
date32 int32 float64 str32
0 2000-01-05 -1 30 A_pre
1 2010-11-23 20 11.5 BBB
2 2020-02-29 20 12.3 NA
3 rows × 4 columns

Note

The na_position parameter is available only in the sort() function.

Note

The default value for na_position is first.

Sorting is possible on multiple columns:

DT[:, :, dt.sort('dates', 'integers')]
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA
4 rows × 4 columns
DT.sort('dates', 'integers')
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12000-01-05-130A_pre
22010-11-232011.5BBB
32020-02-292012.3NA
4 rows × 4 columns
DT[:, :, dt.sort(-f.integers, f.dates)]
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre
4 rows × 4 columns
DT.sort(-f.integers, f.dates)
datesintegersfloatsstrings
date32int32float64str32
0NA24-13D
12010-11-232011.5BBB
22020-02-292012.3NA
32000-01-05-130A_pre
4 rows × 4 columns

Column Assignment¶

Transformed columns can be assigned to new columns, or replace existing columns via direct assignment and the update() function. A third option, extend, applies only when creating new columns.

Direct Assignment¶

Single column:

DT['months'] = DT[:, dt.time.month(f.dates)] DT
dates integers floats strings months
date32 int32 float64 str32 int32
0 2000-01-05 -1 30 A_pre 1
1 2010-11-23 20 11.5 BBB 11
2 2020-02-29 20 12.3 NA 2
3 NA 24 -13 D NA
4 rows × 5 columns
Multiple columns:

DT[:, ['months', 'int_squared']] = DT[:, [dt.time.month(f.dates), f.integers**2]] DT
dates integers floats strings months int_squared
date32 int32 float64 str32 int32 float64
0 2000-01-05 -1 30 A_pre 1 1
1 2010-11-23 20 11.5 BBB 11 400
2 2020-02-29 20 12.3 NA 2 400
3 NA 24 -13 D NA 576
4 rows × 6 columns
Update existing column:

DT['strings'] = DT[:, f.strings[:1]] DT
dates integers floats strings months int_squared
date32 int32 float64 str32 int32 float64
0 2000-01-05 -1 30 A 1 1
1 2010-11-23 20 11.5 B 11 400
2 2020-02-29 20 12.3 NA 2 400
3 NA 24 -13 D NA 576
4 rows × 6 columns
You can assign an f-expressions to create a new column or update an existing column:

DT['integers'] = f.integers * 3 DT
dates integers floats strings months int_squared
date32 int32 float64 str32 int32 float64
0 2000-01-05 -3 30 A 1 1
1 2010-11-23 60 11.5 B 11 400
2 2020-02-29 60 12.3 NA 2 400
3 NA 72 -13 D NA 576
4 rows × 6 columns

`update()`¶

update() is an in-place operation, and as such, a direct assignment is not required.

update() uses keyword arguments, where the key in the dictionary is the name of the new column, while the value is the f-expressions to be computed.

Single column:

DT[:, update(year = dt.time.year(f.dates))] DT
dates integers floats strings months int_squared year
date32 int32 float64 str32 int32 float64 int32
0 2000-01-05 -3 30 A 1 1 2000
1 2010-11-23 60 11.5 B 11 400 2010
2 2020-02-29 60 12.3 NA 2 400 2020
3 NA 72 -13 D NA 576 NA
4 rows × 7 columns

Multiple columns:

DT[:, update(year = dt.time.year(f.dates),
             float_doubled = f.floats * 2)]
DT
datesintegersfloatsstringsmonthsint_squaredyearfloat_doubled
date32int32float64str32int32float64int32float64
02000-01-05-330A11200060
12010-11-236011.5B11400201023
22020-02-296012.3NA2400202024.6
3NA72-13DNA576NA-26
4 rows × 8 columns

Update existing column:

DT[:, update(year = f.year / 12)]
DT
datesintegersfloatsstringsmonthsint_squaredyearfloat_doubled
date32int32float64str32int32float64float64float64
02000-01-05-330A11166.66760
12010-11-236011.5B11400167.523
22020-02-296012.3NA2400168.33324.6
3NA72-13DNA576NA-26
4 rows × 8 columns

Extend¶

The extend method works via f-expressions to create new columns; it does not update existing columns.

The extend method uses a dictionary to create the new columns, where the key in the dictionary is the name of the new column, while the value is the f-expressions to be computed.

First, let’s trim the DT frame:

DT = DT[:, :4]
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05-310A
12010-11-236011.5B
22020-02-296012.3NA
3NA72-13D
4 rows × 4 columns

Add one column:

DT = DT[:, f[:].extend({"months" : dt.time.month(f.dates)})] DT
dates integers floats strings months
date32 int32 float64 str32 int32
0 2000-01-05 -3 30 A 1
1 2010-11-23 60 11.5 B 11
2 2020-02-29 60 12.3 NA 2
3 NA 72 -13 D NA
4 rows × 5 columns
Add multiple columns:

DT = DT[:, f[:].extend({"year" : dt.time.year(f.dates), "int_squared" : f.integers ** 2})] DT
dates integers floats strings months year int_squared
date32 int32 float64 str32 int32 int32 float64
0 2000-01-05 -3 30 A 1 2000 9
1 2010-11-23 60 11.5 B 11 2010 3600
2 2020-02-29 60 12.3 NA 2 2020 3600
3 NA 72 -13 D NA NA 5184
4 rows × 7 columns

Note

In contrast to update(), the result has to be assigned to DT.

	dates	integers	floats	strings
	date32	int32	float64	str32
0	2000-01-05	1	10	A
1	2010-11-23	2	11.5	B
2	2020-02-29	3	12.3	NA
3	NA	4	-13	D

	C0
	str32
0	pre_A
1	pre_B
2	NA
3	pre_D

	C0
	float64
0	11
1	13.5
2	15.3
3	-9

	C0
	str32
0	1_A
1	2_B
2	NA
3	4_D

	C0
	float64
0	1000
1	1520.88
2	1860.87
3	-2197

Transforming Data¶

Example Data¶

Column Transformation¶

Iteration on a Frame¶

Sorting a Frame¶

Column Assignment¶

Direct Assignment¶

update()¶

Extend¶

`update()`¶