Comparison with R’s data.table¶
datatable
is closely related to R’s data.table and attempts to mimic
its API; however, there are differences due to language constraints.
This page shows how to perform similar basic operations in R’s data.table
versus datatable
.
Subsetting Rows¶
The examples used here are from the examples data in R’s data.table
.
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3),
y=c(1,3,6), v=1:9)
from datatable import dt, f, g, by, update, join, sort
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
DT
x | y | v | ||
---|---|---|---|---|
str32 | int32 | int32 | ||
0 | b | 1 | 1 | |
1 | b | 3 | 2 | |
2 | b | 6 | 3 | |
3 | a | 1 | 4 | |
4 | a | 3 | 5 | |
5 | a | 6 | 6 | |
6 | c | 1 | 7 | |
7 | c | 3 | 8 | |
8 | c | 6 | 9 |
Action |
data.table |
datatable |
---|---|---|
Select 2nd row |
|
|
Select 2nd and 3rd row |
|
|
Select 3rd and 2nd row |
|
|
Select 2nd and 5th rows |
|
|
Select all rows from 2nd to 5th |
|
|
Select rows in reverse from 5th to the 1st |
|
|
Select the last row |
|
|
All rows where |
|
|
Compound logical expressions |
|
|
All rows other than rows 2,3,4 |
|
|
Sort by column |
|
DT.sort("x") orDT[:, :, sort("x")] |
Sort by column |
|
DT.sort(-f.x) orDT[:, :, sort(-f.x)] |
Sort by column |
|
DT.sort(x, -f.y) orDT[:, :, sort(f.x, -f.y)] |
Note
Note the use of the f
symbol when performing computations or
sorting in descending order. You can read more about f-expressions.
Note
In R, DT[2]
would mean 2nd row, whereas in python DT[2]
would
select the 3rd column.
In data.table
, when selecting rows you do not need to indicate the columns.
So, something like the code below works fine:
# data.table
DT[y==3]
x y v
1: b 3 2
2: a 3 5
3: c 3 8
In datatable
, however, when selecting rows there has to be a column
selector, or you get an error:
DT[f.y == 3]
The code above fails because datatable
only allows single-column selection
using the style above:
DT['y']
y | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 3 | |
2 | 6 | |
3 | 1 | |
4 | 3 | |
5 | 6 | |
6 | 1 | |
7 | 3 | |
8 | 6 |
As such, when datatable
sees an f-expressions, it thinks you are
selecting a column, and appropriately errors out.
Since, in this case, we are selecting all columns, we can use either a colon
(:
) or the Ellipsis symbol(...
):
DT[f.y==3, :]
DT[f.y==3, ...]
Selecting columns¶
Action |
data.table |
datatable |
---|---|---|
Select column |
|
|
Select multiple columns |
|
|
Rename and select column |
|
|
Sum column |
|
|
Return two columns, |
|
|
Select the second column |
|
|
Select last column |
|
|
Select columns |
|
|
Exclude columns |
|
DT[:, [name not in ("x","y") for name in DT.names]] orDT[:, f[:].remove(f['x':'y'])] |
Select columns that start with |
|
DT[:, [name.startswith(("x", "v")) for name in DT.names]] |
In data.table
, you can select a column by using a variable name with the
double dots prefix:
col = 'v'
DT[, ..col]
In datatable
, you do not need the prefix:
col = 'v'
DT[:, col] # or DT[col]
v | ||
---|---|---|
float64 | ||
0 | 1 | |
1 | 1.41421 | |
2 | 1.73205 | |
3 | 2 | |
4 | 2.23607 | |
5 | 2.44949 | |
6 | 2.64575 | |
7 | 2.82843 | |
8 | 3 |
If the column names are stored in a character vector, the double dots prefix also works:
cols = c('v', 'y')
DT[, ..cols]
In datatable
, you can store the list/tuple of column names in a variable
cols = ['v', 'y']
DT[:, cols]
v | y | ||
---|---|---|---|
float64 | float64 | ||
0 | 1 | 1 | |
1 | 1.41421 | 1.73205 | |
2 | 1.73205 | 2.44949 | |
3 | 2 | 1 | |
4 | 2.23607 | 1.73205 | |
5 | 2.44949 | 2.44949 | |
6 | 2.64575 | 1 | |
7 | 2.82843 | 1.73205 | |
8 | 3 | 2.44949 |
Subset rows and Select/Aggregate¶
Action |
data.table |
datatable |
---|---|---|
Sum column |
|
|
Same as above, new column name |
|
|
Filter in |
|
|
Same as above, return as scalar |
|
|
In R indexing starts at 1 and when slicing, the first and the last items are both included. However, in Python, indexing starts at 0, and when slicing all items except the last are included.
Some SD
(Subset of Data) operations can be replicated in datatable
Aggregate several columns¶
DT[, lapply(.SD, mean), .SDcols = c("y","v")]
y v
1: 3.333333 5
DT[:, dt.mean([f.y,f.v])]
y | v | ||
---|---|---|---|
float64 | float64 | ||
0 | 3.33333 | 5 |
Modify columns using a condition¶
DT[, .SD - 1, .SDcols = is.numeric]
y v
1: 0 0
2: 2 1
3: 5 2
4: 0 3
5: 2 4
6: 5 5
7: 0 6
8: 2 7
9: 5 8
DT[:, f[int] - 1]
C0 | C1 | ||
---|---|---|---|
int32 | int32 | ||
0 | 0 | 0 | |
1 | 2 | 1 | |
2 | 5 | 2 | |
3 | 0 | 3 | |
4 | 2 | 4 | |
5 | 5 | 5 | |
6 | 0 | 6 | |
7 | 2 | 7 | |
8 | 5 | 8 |
Modify several columns and keep others unchanged¶
DT[, c("y", "v") := lapply(.SD, sqrt),
.SDcols = c("y", "v")]
x y v
1: b 1.000000 1.000000
2: b 1.732051 1.414214
3: b 2.449490 1.732051
4: a 1.000000 2.000000
5: a 1.732051 2.236068
6: a 2.449490 2.449490
7: c 1.000000 2.645751
8: c 1.732051 2.828427
9: c 2.449490 3.000000
# there is a square root function the datatable math module
DT[:, update(**{name:f[name]**0.5 for name in ("y","v")})]
DT
x | y | v | ||
---|---|---|---|---|
str32 | float64 | float64 | ||
0 | b | 1 | 1 | |
1 | b | 1.73205 | 1.41421 | |
2 | b | 2.44949 | 1.73205 | |
3 | a | 1 | 2 | |
4 | a | 1.73205 | 2.23607 | |
5 | a | 2.44949 | 2.44949 | |
6 | c | 1 | 2.64575 | |
7 | c | 1.73205 | 2.82843 | |
8 | c | 2.44949 | 3 |
Grouping with by()
¶
Action |
data.table |
datatable |
---|---|---|
Get the sum of column |
|
|
Get sum of |
|
|
Number of rows per group |
|
|
Select first row of |
|
|
Get row count and sum columns |
|
|
Expressions in |
|
|
Get row per group where column |
|
|
First 2 rows of each group |
|
|
Last 2 rows of each group |
|
|
In R’s data.table
, the order of the groupings is preserved; in
datatable
, the returned dataframe is sorted on the grouping column.
DT[, sum(v), keyby=x]
in data.table returns a dataframe ordered by
column x
.
In data.table
, i
is executed before the grouping, while in
datatable
, i
is executed after the grouping.
Also, in datatable
, f-expressions in the i
section of a
groupby is not yet implemented, hence the chaining method to get the sum of
column v
where x!=a
.
Multiple aggregations within a group can be executed in R’s data.table
with the syntax below:
DT[, list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x, y%%2)]
The same can be replicated in datatable
by using a dictionary:
DT[:, {'MySum': dt.sum(f.v),
'MyMin': dt.min(f.v),
'MyMax': dt.max(f.v)},
by(f.x, f.y%2)]
Add/Update/Delete Columns¶
Action |
data.table |
datatable |
---|---|---|
Add new column |
|
DT[:, update(z=42)] orDT['z'] = 42 orDT[:, 'z'] = 42 orDT = DT[:, f[:].extend({"z":42})] |
Add multiple columns |
|
DT[:, update(sv = dt.sum(f.v), mv = "X")] orDT[:, f[:].extend({"sv": dt.sum(f.v), "mv": "X"})] |
Remove column |
|
del DT['z'] ordel DT[:, 'z'] orDT = DT[:, f[:].remove(f.z)] |
Subassign to existing |
|
DT[f.x=="a", update(v=42)] orDT[f.x=="a", 'v'] = 42 |
Subassign to new column (NA padded) |
|
DT[f.x=="b", update(v2=84)] orDT[f.x=='b', 'v2'] = 84 |
Add new column, assigning values group-wise |
|
DT[:, update(m=dt.mean(f.v)), by("x")] |
In data.table
, you can create a new column with a variable
col = 'rar'
DT[, ..col:=4242]
Similar operation for the above in datatable
:
col = 'rar'
DT[col] = 4242
# or DT[:, update(col = 4242)]
Joins¶
At the moment, only the left outer join is implemented in datatable
.
Another aspect is that the dataframe being joined must be keyed, the column or
columns to be keyed must not have duplicates, and the joining column has to
have the same name in both dataframes. You can read more about the
join()
API and have a look at the Tutorial on join operators.
Left join in R’s data.table
:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X[DT, on="x"]
x v foo y i.v
1: b 7 2 1 1
2: b 7 2 3 2
3: b 7 2 6 3
4: a NA NA 1 4
5: a NA NA 3 5
6: a NA NA 6 6
7: c 8 4 1 7
8: c 8 4 3 8
9: c 8 4 6 9
Join in datatable
:
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
X = dt.Frame({"x":('c','b'),
"v":(8,7),
"foo":(4,2)})
X.key = "x"
DT[:, :, join(X)]
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 1 | 1 | 7 | 2 | |
1 | b | 3 | 2 | 7 | 2 | |
2 | b | 6 | 3 | 7 | 2 | |
3 | a | 1 | 4 | NA | NA | |
4 | a | 3 | 5 | NA | NA | |
5 | a | 6 | 6 | NA | NA | |
6 | c | 1 | 7 | 8 | 4 | |
7 | c | 3 | 8 | 8 | 4 | |
8 | c | 6 | 9 | 8 | 4 |
An inner join could be simulated by removing the nulls. Again, a join()
only works if the joining dataframe is keyed.
DT[X, on="x", nomatch=NULL]
x y v i.v foo
1: c 1 7 8 4
2: c 3 8 8 4
3: c 6 9 8 4
4: b 1 1 7 2
5: b 3 2 7 2
6: b 6 3 7 2
DT[g[-1] != None, :, join(X)] # g refers to the joining dataframe X
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 1 | 1 | 7 | 2 | |
1 | b | 3 | 2 | 7 | 2 | |
2 | b | 6 | 3 | 7 | 2 | |
3 | c | 1 | 7 | 8 | 4 | |
4 | c | 3 | 8 | 8 | 4 | |
5 | c | 6 | 9 | 8 | 4 |
A not join can be simulated as well:
DT[!X, on="x"]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
DT[g[-1]==None, f[:], join(X)]
x | y | v | ||
---|---|---|---|---|
str32 | int32 | int32 | ||
0 | a | 1 | 4 | |
1 | a | 3 | 5 | |
2 | a | 6 | 6 |
Select the first row for each group:
DT[X, on="x", mult="first"]
x y v i.v foo
1: c 1 7 8 4
2: b 1 1 7 2
DT[g[-1] != None, :, join(X)][0, :, by('x')] # chaining comes in handy here
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 1 | 1 | 7 | 2 | |
1 | c | 1 | 7 | 8 | 4 |
Select the last row for each group:
DT[X, on="x", mult="last"]
x y v i.v foo
1: c 6 9 8 4
2: b 6 3 7 2
DT[g[-1]!=None, :, join(X)][-1, :, by('x')]
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 6 | 3 | 7 | 2 | |
1 | c | 6 | 9 | 8 | 4 |
Join and evaluate j
for each row in i
:
DT[X, sum(v), by=.EACHI, on="x"]
x V1
1: c 24
2: b 6
DT[g[-1]!=None, :, join(X)][:, dt.sum(f.v), by("x")]
x | v | ||
---|---|---|---|
str32 | int64 | ||
0 | b | 6 | |
1 | c | 24 |
Aggregate on columns from both dataframes in j
:
DT[X, sum(v)*foo, by=.EACHI, on="x"]
x V1
1: c 96
2: b 12
DT[:, dt.sum(f.v*g.foo), join(X), by(f.x)][f[-1]!=0, :]
x | C0 | ||
---|---|---|---|
str32 | int64 | ||
0 | b | 12 | |
1 | c | 96 |
Aggregate on columns with same name from both dataframes in j
:
DT[X, sum(v)*i.v, by=.EACHI, on="x"]
x V1
1: c 192
2: b 42
DT[:, dt.sum(f.v*g.v), join(X), by(f.x)][f[-1]!=0, :]
x | C0 | ||
---|---|---|---|
str32 | int64 | ||
0 | b | 42 | |
1 | c | 192 |
Expect significant improvement in join functionality, with more concise syntax, as well as additions of more features in the future.
Functions in R/data.table not yet implemented¶
This is a list of some functions in data.table
that do not have an
equivalent in datatable
yet, that we would likely implement
Reshaping functions
Convenience functions for filtering and subsetting
Duplicate functions
Aggregation functions
Missing values functions
Also, at the moment, custom aggregations in the j
section are not supported
in datatable
– we intend to implement that at some point.
There are no datetime functions in datatable
, and string operations are
limited as well.
If there are any functions that you would like to see in datatable
, please
head over to github and raise a feature request.