Transforming Data¶
Example Data¶
from datatable import dt, f, update, ifelse
from datetime import date
source = {"dates" : [date(2000, 1, 5), date(2010, 11, 23), date(2020, 2, 29), None],
"integers" : range(1, 5),
"floats" : [10.0, 11.5, 12.3, -13],
"strings" : ['A', 'B', None, 'D']
}
DT = dt.Frame(source)
DT
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2000-01-05 | 1 | 10 | A | |
| 1 | 2010-11-23 | 2 | 11.5 | B | |
| 2 | 2020-02-29 | 3 | 12.3 | NA | |
| 3 | NA | 4 | -13 | D |
Column Transformation¶
Operations on columns occur in the j section of DT[i, j, by/sort/...], and involve the use of f-expressions:
DT[:, f.integers * 2]
| C0 | ||
|---|---|---|
| int32 | ||
| 0 | 2 | |
| 1 | 4 | |
| 2 | 6 | |
| 3 | 8 |
DT[:, 'pre_' + f.strings]
| C0 | ||
|---|---|---|
| str32 | ||
| 0 | pre_A | |
| 1 | pre_B | |
| 2 | NA | |
| 3 | pre_D |
Operation between columns is also possible:
DT[:, f.integers + f.floats]
| C0 | ||
|---|---|---|
| float64 | ||
| 0 | 11 | |
| 1 | 13.5 | |
| 2 | 15.3 | |
| 3 | -9 |
DT[:, f.integers +'_' + f.strings]
| C0 | ||
|---|---|---|
| str32 | ||
| 0 | 1_A | |
| 1 | 2_B | |
| 2 | NA | |
| 3 | 4_D |
Note
Operations between columns create new columns, which names are autogenerated, unless otherwise specified.
Various datatable functions can be applied to the columns:
DT[:, dt.math.pow(f.floats, 3)]
| C0 | ||
|---|---|---|
| float64 | ||
| 0 | 1000 | |
| 1 | 1520.88 | |
| 2 | 1860.87 | |
| 3 | -2197 |
DT[:, dt.time.year(f.dates)]
| dates | ||
|---|---|---|
| int32 | ||
| 0 | 2000 | |
| 1 | 2010 | |
| 2 | 2020 | |
| 3 | NA |
Most of the datatable functions operate column-wise:
DT[:, dt.sum(f['integers':'floats'])]
| integers | floats | ||
|---|---|---|---|
| int64 | float64 | ||
| 0 | 10 | 20.8 |
while the ones with the row prefix operate row-wise:
DT[:, dt.rowsum(f['integers':'floats'])]
| C0 | ||
|---|---|---|
| float64 | ||
| 0 | 11 | |
| 1 | 13.5 | |
| 2 | 15.3 | |
| 3 | -9 |
Transformation of a column based on a condition is possible via ifelse(), which operates similarly to Python’s if-else idiom:
DT[:, ifelse(f.integers % 2 == 0, 'even', 'odd')]
| C0 | ||
|---|---|---|
| str32 | ||
| 0 | odd | |
| 1 | even | |
| 2 | odd | |
| 3 | even |
Transforming a column via a boolean condition in the i section is possible either; note that this could result in a reduced number of rows:
DT[f.dates < dt.time.ymd(2020,1,1), f.integers ** 4]
| C0 | ||
|---|---|---|
| float64 | ||
| 0 | 1 | |
| 1 | 16 |
Column type can be changed with the as_type() function
Single column:
DT[:, dt.as_type(f.integers, str)]
integers str32 0 1 1 2 2 3 3 4 DT[:, dt.as_type(f.integers, dt.Type.str32)]
integers str32 0 1 1 2 2 3 3 4
Multiple columns:
DT[:, dt.as_type(f['integers', 'floats'], str)]integers floats str32 str32 0 1 10.0 1 2 11.5 2 3 12.3 3 4 -13.0
Values across multiple columns can be replaced with the dt.Frame.replace() method; this works on the entire frame, and is an in-place operation.
The dt.Frame.replace() syntax is frame.replace(replace_what, replace_with), where replace_what values in the frame are replaced with replaced_with:
DT.replace(1, -1)
DT
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2000-01-05 | -1 | 10 | A | |
| 1 | 2010-11-23 | 2 | 11.5 | B | |
| 2 | 2020-02-29 | 3 | 12.3 | NA | |
| 3 | NA | 4 | -13 | D |
For multiple values, a list or a dictionary can be used.
If a list is used, the number of entries in the replace_what list must match the number of entries in the replace_with list, or the number of entries in replace_with must be exactly 1:
DT.replace(['A', 10.0], ['A_pre', 30.0])
DT
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2000-01-05 | -1 | 30 | A_pre | |
| 1 | 2010-11-23 | 2 | 11.5 | B | |
| 2 | 2020-02-29 | 3 | 12.3 | NA | |
| 3 | NA | 4 | -13 | D |
DT.replace([2, 3], 20)
DT
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2000-01-05 | -1 | 30 | A_pre | |
| 1 | 2010-11-23 | 20 | 11.5 | B | |
| 2 | 2020-02-29 | 20 | 12.3 | NA | |
| 3 | NA | 4 | -13 | D |
If a dictionary is used, the replace_what values serve as the keys, while the replace_with values are the values in the dictionary:
DT.replace({4: 24, 'B': 'BBB'})
DT
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2000-01-05 | -1 | 30 | A_pre | |
| 1 | 2010-11-23 | 2 | 11.5 | BBB | |
| 2 | 2020-02-29 | 3 | 12.3 | NA | |
| 3 | NA | 24 | -13 | D |
The replace_what values, that are not found in the frame, are ignored:
DT.replace({355:26})
DT
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2000-01-05 | -1 | 30 | A_pre | |
| 1 | 2010-11-23 | 2 | 11.5 | BBB | |
| 2 | 2020-02-29 | 3 | 12.3 | NA | |
| 3 | NA | 24 | -13 | D |
Have a look at dt.Frame.replace() for more options, especially when replacing null values.
Iteration on a Frame¶
Iterating through a Frame allows access to the individual columns; in this case, each column gets
converted to one-column frame:
print(*[col for col in DT], sep="\n")
| dates | ||
|---|---|---|
| date32 | ||
| 0 | 2000-01-05 | |
| 1 | 2010-11-23 | |
| 2 | 2020-02-29 | |
| 3 | NA |
| integers | ||
|---|---|---|
| int32 | ||
| 0 | -1 | |
| 1 | 20 | |
| 2 | 20 | |
| 3 | 24 |
| floats | ||
|---|---|---|
| float64 | ||
| 0 | 30 | |
| 1 | 11.5 | |
| 2 | 12.3 | |
| 3 | -13 |
| strings | ||
|---|---|---|
| str32 | ||
| 0 | A_pre | |
| 1 | BBB | |
| 2 | NA | |
| 3 | D |
With iteration, different operations can be applied to different columns:
outcome = [frame.mean() if frame.type.is_numeric else frame[0, :] for frame in DT]
outcome[0]
| dates | ||
|---|---|---|
| date32 | ||
| 0 | 2000-01-05 |
outcome[1]
| integers | ||
|---|---|---|
| float64 | ||
| 0 | 15.75 |
outcome[2]
| floats | ||
|---|---|---|
| float64 | ||
| 0 | 10.2 |
outcome[3]
| strings | ||
|---|---|---|
| str32 | ||
| 0 | A_pre |
dt.cbind(outcome)
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | float64 | float64 | str32 | ||
| 0 | 2000-01-05 | 15.75 | 10.2 | A_pre |
Sorting a Frame¶
A frame can be sorted via the sort() function, or the dt.Frame.sort() method:
DT[:, :, dt.sort('dates')]
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | NA | 24 | -13 | D | |
| 1 | 2000-01-05 | -1 | 30 | A_pre | |
| 2 | 2010-11-23 | 20 | 11.5 | BBB | |
| 3 | 2020-02-29 | 20 | 12.3 | NA |
DT.sort('dates')
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | NA | 24 | -13 | D | |
| 1 | 2000-01-05 | -1 | 30 | A_pre | |
| 2 | 2010-11-23 | 20 | 11.5 | BBB | |
| 3 | 2020-02-29 | 20 | 12.3 | NA |
Sorting is also possible via f-expressions:
DT[:, :, dt.sort(f.floats)]
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | NA | 24 | -13 | D | |
| 1 | 2010-11-23 | 20 | 11.5 | BBB | |
| 2 | 2020-02-29 | 20 | 12.3 | NA | |
| 3 | 2000-01-05 | -1 | 30 | A_pre |
DT.sort(f.strings)
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2020-02-29 | 20 | 12.3 | NA | |
| 1 | 2000-01-05 | -1 | 30 | A_pre | |
| 2 | 2010-11-23 | 20 | 11.5 | BBB | |
| 3 | NA | 24 | -13 | D |
The default sorting order is ascending; if there are any nulls in the sorting columns, they go to the top.
The sorting order and the position of nulls can be changed in a number of ways:
Sorting can be done in descending order via the reverse parameter:
DT[:, :, dt.sort('integers', reverse = True)]dates integers floats strings date32 int32 float64 str32 0 NA 24 -13 D 1 2010-11-23 20 11.5 BBB 2 2020-02-29 20 12.3 NA 3 2000-01-05 -1 30 A_pre
Note
The reverse parameter is available only in the sort() function
Sorting in descending order is also possible by negating the f-expressions within the
sort()function, or thedt.Frame.sort()method:DT[:, :, dt.sort(-f.integers)]dates integers floats strings date32 int32 float64 str32 0 NA 24 -13 D 1 2010-11-23 20 11.5 BBB 2 2020-02-29 20 12.3 NA 3 2000-01-05 -1 30 A_pre DT.sort(-f.integers)dates integers floats strings date32 int32 float64 str32 0 NA 24 -13 D 1 2010-11-23 20 11.5 BBB 2 2020-02-29 20 12.3 NA 3 2000-01-05 -1 30 A_pre The position of null values within the sorting column can be controlled with the
na_positionparameter:DT[:, :, dt.sort('dates', na_position = 'last')]dates integers floats strings date32 int32 float64 str32 0 2000-01-05 -1 30 A_pre 1 2010-11-23 20 11.5 BBB 2 2020-02-29 20 12.3 NA 3 NA 24 -13 D Rows with null values can be removed by passing remove to the
na_positionparameter:# only the row where date is null is removed DT[:, :, dt.sort('dates', na_position='remove')]dates integers floats strings date32 int32 float64 str32 0 2000-01-05 -1 30 A_pre 1 2010-11-23 20 11.5 BBB 2 2020-02-29 20 12.3 NA
Note
The na_position parameter is available only in the sort() function.
Note
The default value for na_position is first.
Sorting is possible on multiple columns:
DT[:, :, dt.sort('dates', 'integers')]
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | NA | 24 | -13 | D | |
| 1 | 2000-01-05 | -1 | 30 | A_pre | |
| 2 | 2010-11-23 | 20 | 11.5 | BBB | |
| 3 | 2020-02-29 | 20 | 12.3 | NA |
DT.sort('dates', 'integers')
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | NA | 24 | -13 | D | |
| 1 | 2000-01-05 | -1 | 30 | A_pre | |
| 2 | 2010-11-23 | 20 | 11.5 | BBB | |
| 3 | 2020-02-29 | 20 | 12.3 | NA |
DT[:, :, dt.sort(-f.integers, f.dates)]
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | NA | 24 | -13 | D | |
| 1 | 2010-11-23 | 20 | 11.5 | BBB | |
| 2 | 2020-02-29 | 20 | 12.3 | NA | |
| 3 | 2000-01-05 | -1 | 30 | A_pre |
DT.sort(-f.integers, f.dates)
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | NA | 24 | -13 | D | |
| 1 | 2010-11-23 | 20 | 11.5 | BBB | |
| 2 | 2020-02-29 | 20 | 12.3 | NA | |
| 3 | 2000-01-05 | -1 | 30 | A_pre |
Column Assignment¶
Transformed columns can be assigned to new columns, or replace existing columns via direct assignment and the update() function. A third option, extend, applies only when creating new columns.
Direct Assignment¶
Single column:
DT['months'] = DT[:, dt.time.month(f.dates)] DTdates integers floats strings months date32 int32 float64 str32 int32 0 2000-01-05 -1 30 A_pre 1 1 2010-11-23 20 11.5 BBB 11 2 2020-02-29 20 12.3 NA 2 3 NA 24 -13 D NA Multiple columns:
DT[:, ['months', 'int_squared']] = DT[:, [dt.time.month(f.dates), f.integers**2]] DTdates integers floats strings months int_squared date32 int32 float64 str32 int32 float64 0 2000-01-05 -1 30 A_pre 1 1 1 2010-11-23 20 11.5 BBB 11 400 2 2020-02-29 20 12.3 NA 2 400 3 NA 24 -13 D NA 576 Update existing column:
DT['strings'] = DT[:, f.strings[:1]] DTdates integers floats strings months int_squared date32 int32 float64 str32 int32 float64 0 2000-01-05 -1 30 A 1 1 1 2010-11-23 20 11.5 B 11 400 2 2020-02-29 20 12.3 NA 2 400 3 NA 24 -13 D NA 576 You can assign an f-expressions to create a new column or update an existing column:
DT['integers'] = f.integers * 3 DTdates integers floats strings months int_squared date32 int32 float64 str32 int32 float64 0 2000-01-05 -3 30 A 1 1 1 2010-11-23 60 11.5 B 11 400 2 2020-02-29 60 12.3 NA 2 400 3 NA 72 -13 D NA 576
update()¶
update() is an in-place operation, and as such, a direct assignment is not required.
update() uses keyword arguments, where the key in the dictionary is the name of the new column, while the value is the f-expressions to be computed.
Single column:
DT[:, update(year = dt.time.year(f.dates))] DTdates integers floats strings months int_squared year date32 int32 float64 str32 int32 float64 int32 0 2000-01-05 -3 30 A 1 1 2000 1 2010-11-23 60 11.5 B 11 400 2010 2 2020-02-29 60 12.3 NA 2 400 2020 3 NA 72 -13 D NA 576 NA Multiple columns:
DT[:, update(year = dt.time.year(f.dates), float_doubled = f.floats * 2)] DTdates integers floats strings months int_squared year float_doubled date32 int32 float64 str32 int32 float64 int32 float64 0 2000-01-05 -3 30 A 1 1 2000 60 1 2010-11-23 60 11.5 B 11 400 2010 23 2 2020-02-29 60 12.3 NA 2 400 2020 24.6 3 NA 72 -13 D NA 576 NA -26 Update existing column:
DT[:, update(year = f.year / 12)] DTdates integers floats strings months int_squared year float_doubled date32 int32 float64 str32 int32 float64 float64 float64 0 2000-01-05 -3 30 A 1 1 166.667 60 1 2010-11-23 60 11.5 B 11 400 167.5 23 2 2020-02-29 60 12.3 NA 2 400 168.333 24.6 3 NA 72 -13 D NA 576 NA -26
Extend¶
The extend method works via f-expressions to create new columns; it does not update existing columns.
The extend method uses a dictionary to create the new columns, where the key in the dictionary is the name of the new column, while the value is the f-expressions to be computed.
First, let’s trim the DT frame:
DT = DT[:, :4]
DT
| dates | integers | floats | strings | ||
|---|---|---|---|---|---|
| date32 | int32 | float64 | str32 | ||
| 0 | 2000-01-05 | -3 | 10 | A | |
| 1 | 2010-11-23 | 60 | 11.5 | B | |
| 2 | 2020-02-29 | 60 | 12.3 | NA | |
| 3 | NA | 72 | -13 | D |
Add one column:
DT = DT[:, f[:].extend({"months" : dt.time.month(f.dates)})] DTdates integers floats strings months date32 int32 float64 str32 int32 0 2000-01-05 -3 30 A 1 1 2010-11-23 60 11.5 B 11 2 2020-02-29 60 12.3 NA 2 3 NA 72 -13 D NA Add multiple columns:
DT = DT[:, f[:].extend({"year" : dt.time.year(f.dates), "int_squared" : f.integers ** 2})] DTdates integers floats strings months year int_squared date32 int32 float64 str32 int32 int32 float64 0 2000-01-05 -3 30 A 1 2000 9 1 2010-11-23 60 11.5 B 11 2010 3600 2 2020-02-29 60 12.3 NA 2 2020 3600 3 NA 72 -13 D NA NA 5184
Note
In contrast to update(), the result has to be assigned to DT.