Comparison with SQL

This page provides some examples of how various SQL operations can be performed in datatable. The datatable library is still growing; as such, not all functions in SQL can be replicated yet. If there is a feature you would love to have in datatable, please make a feature request on the github issues page.

Most of the examples will be based on the famous iris dataset. SQLite will be the flavour of SQL used in the comparison.

Let’s import datatable and read in the data using its fread() function:

from datatable import dt, f, g, by, join, sort, update, fread iris = fread('https://raw.githubusercontent.com/h2oai/datatable/main/docs/_static/iris.csv') iris
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
105.43.71.50.2setosa
114.83.41.60.2setosa
124.831.40.1setosa
134.331.10.1setosa
145.841.20.2setosa
1456.735.22.3virginica
1466.32.551.9virginica
1476.535.22virginica
1486.23.45.42.3virginica
1495.935.11.8virginica

Loading data into an SQL table is a bit more involved, where you need to create the structure of the table (a schema), before importing the csv file. Have a look at SQLite import tutorial for an example on loading data into a SQLite datatabase.

SELECT

In SQL, you can select a subset of the columns with the SELECT clause:

SELECT sepal_length, sepal_width, petal_length FROM iris LIMIT 5;

In datatable, columns are selected in the j section:

iris[:5, ['sepal_length', 'sepal_width', 'petal_length']]
sepal_lengthsepal_widthpetal_length
float64float64float64
05.13.51.4
14.931.4
24.73.21.3
34.63.11.5
453.61.4

In SQL, you can select all columns with the * symbol:

SELECT * FROM iris LIMIT 5;

In datatable, all columns can be selected with a simple “select-all” slice :, or with f-expressions:

iris[:5, :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa

If you are selecting a single column, datatable allows you to access just the j section within the square brackets; you do not need to include the i section: DT[j]

SELECT sepal_length FROM iris LIMIT 5;
# datatable iris['sepal_length'].head(5)
sepal_length
float64
05.1
14.9
24.7
34.6
45

How about adding new columns? In SQL, this is done also in the SELECT clause:

SELECT *, sepal_length*2 AS sepal_length_doubled FROM iris LIMIT 5;

In datatable, addition of new columns occurs in the j section:

iris[:5, f[:].extend({"sepal_length_doubled": f.sepal_length * 2})]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
05.13.51.40.2setosa10.2
14.931.40.2setosa9.8
24.73.21.30.2setosa9.4
34.63.11.50.2setosa9.2
453.61.40.2setosa10

The update() function can also be used to add new columns. The operation occurs in-place; reassignment is not required:

iris[:, update(sepal_length_doubled = f.sepal_length * 2)] iris[:5, :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
05.13.51.40.2setosa10.2
14.931.40.2setosa9.8
24.73.21.30.2setosa9.4
34.63.11.50.2setosa9.2
453.61.40.2setosa10

WHERE

Filtering in SQL is done via the WHERE clause.

SELECT * FROM iris WHERE species = 'virginica' LIMIT 5;

In datatable, filtration is done in the i section:

iris[f.species=="virginica", :].head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
06.33.362.5virginica12.6
15.82.75.11.9virginica11.6
27.135.92.1virginica14.2
36.32.95.61.8virginica12.6
46.535.82.2virginica13

Note that in SQL, equality comparison is done with the = symbol, whereas in python, it is with the == operator. You can filter with multple conditions too:

SELECT * FROM iris WHERE species = 'setosa' AND sepal_length = 5;

In datatable each condition is wrapped in parentheses; the & operator is the equivalent of AND, while | is the equivalent of OR:

iris[(f.species=="setosa") & (f.sepal_length==5), :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
053.61.40.2setosa10
153.41.50.2setosa10
2531.60.2setosa10
353.41.60.4setosa10
453.21.20.2setosa10
553.51.30.3setosa10
653.51.60.6setosa10
753.31.40.2setosa10

Now suppose you have a frame where some values are missing (NA):

null_data = dt.Frame(""" a b c 1 2 3 1 NaN 4 2 1 3 1 2 2""") null_data
abc
int32float64int32
0123
11NA4
2213
3122

In SQL you could filter out those values like this:

SELECT * FROM null_data WHERE b is NOT NULL;

In datatable, the NOT operator is replicated with the != symbol:

null_data[f.b!=None, :]
abc
int32float64int32
0123
1213
2122

You could also use isna function with the ~ operator which inverts boolean expressions:

null_data[~dt.math.isna(f.b), :]
abc
int32float64int32
0123
1213
2122

Keeping the null rows is easily achievable; it is simply the inverse of the above code:

SELECT * FROM null_data WHERE b is NULL;
null_data[dt.isna(f.b), :]
abc
int32float64int32
01NA4
null_data[dt.isna(f.b), :]

Note

SQL has the IN operator, which does not have an equivalent in datatable yet.

ORDER BY

In SQL, sorting is executed with the ORDER BY clause, while in datatable it is handled by the sort() function.

SELECT * FROM iris ORDER BY sepal_length ASC limit 5;
iris[:5, :, sort('sepal_length')]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
04.331.10.1setosa8.6
14.42.91.40.2setosa8.8
24.431.30.2setosa8.8
34.43.21.30.2setosa8.8
44.52.31.30.3setosa9

Sorting in descending order in SQL is with the DESC.

SELECT * FROM iris ORDER BY sepal_length DESC limit 5;

In datatable, this can be achieved in two ways:

iris[:5, :, sort('sepal_length', reverse=True)]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
07.93.86.42virginica15.8
17.73.86.72.2virginica15.4
27.72.66.92.3virginica15.4
37.72.86.72virginica15.4
47.736.12.3virginica15.4

or, you could negate the sorting column; datatable will correctly interprete the negation(-) as descending order:

iris[:5, :, sort(-f.sepal_length)]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
07.93.86.42virginica15.8
17.73.86.72.2virginica15.4
27.72.66.92.3virginica15.4
37.72.86.72virginica15.4
47.736.12.3virginica15.4

GROUP BY

SQL’s GROUP BY operations can be performed in datatable with the by() function. Have a look at the by() API, as well as the Grouping with by() user guide.

Let’s look at some common grouping operations in SQL, and their equivalents in datatable.

Single aggregation per group

SELECT species, COUNT() AS N FROM iris GROUP BY species;
iris[:, dt.count(), by('species')]
speciescount
str32int64
0setosa50
1versicolor50
2virginica50

Multiple aggregations per group

SELECT species, COUNT() AS N, AVG(sepal_length) AS mean_sepal_length FROM iris GROUP BY species;
iris[:, {"mean_sepal_length": dt.mean(f.sepal_length), "N": dt.count()}, by('species')]
speciesmean_sepal_lengthN
str32float64int64
0setosa5.00650
1versicolor5.93650
2virginica6.58850

Grouping on multiple columns

fruits_data
FruitDateNameNumber
str32str32str32int32
0Apples10/6/2016Bob7
1Apples10/6/2016Bob8
2Apples10/6/2016Mike9
3Apples10/7/2016Steve10
4Apples10/7/2016Bob1
5Oranges10/7/2016Bob2
6Oranges10/6/2016Tom15
7Oranges10/6/2016Mike57
8Oranges10/6/2016Bob65
9Oranges10/7/2016Tony1
10Grapes10/7/2016Bob1
11Grapes10/7/2016Tom87
12Grapes10/7/2016Bob22
13Grapes10/7/2016Bob12
14Grapes10/7/2016Tony15
SELECT fruit, name, SUM(number) AS sum_num FROM fruits_data GROUP BY fruit, name;
fruits_data[:, {"sum_num": dt.sum(f.Number)}, by('Fruit', 'Name')]
FruitNamesum_num
str32str32int64
0ApplesBob16
1ApplesMike9
2ApplesSteve10
3GrapesBob35
4GrapesTom87
5GrapesTony15
6OrangesBob67
7OrangesMike57
8OrangesTom15
9OrangesTony1

WHERE with GROUP BY

SELECT species, AVG(sepal_length) AS avg_sepal_length FROM iris WHERE sepal_width > 3 GROUP BY species;
iris[f.sepal_width >=3, :][:, {"avg_sepal_length": dt.mean(f.sepal_length)}, by('species')]
speciesavg_sepal_length
str32float64
0setosa5.02917
1versicolor6.21875
2virginica6.76897

HAVING with GROUP BY

SELECT fruit, name, SUM(number) AS sum_num FROM fruits_data GROUP BY fruit, name HAVING sum_num > 50;
fruits_data[:, {'sum_num': dt.sum(f.Number)}, by('Fruit','Name')][f.sum_num > 50, :]
FruitNamesum_num
str32str32int64
0GrapesTom87
1OrangesBob67
2OrangesMike57

Grouping on a condition

SELECT sepal_width >=3 AS width_larger_than_3, AVG(sepal_length) AS avg_sepal_length FROM iris GROUP BY sepal_width>=3;
iris[:, {"avg_sepal_length": dt.mean(f.sepal_length)}, by(f.sepal_width >= 3)]
C0avg_sepal_length
bool8float64
005.95263
115.77634

At the moment, names cannot be assigned in the by section.

LEFT OUTER JOIN

We will compare the left outer join, as that is the only join currently implemented in datatable. Another aspect is that the frame being joined must be keyed, the column or columns to be keyed must not have duplicates, and the joining column has to have the same name in both frames. You can read more about the join() API and have a look at the join(…).

Example data:

DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3, y = [1, 3, 6] * 3, v = range(1, 10)) X = dt.Frame({"x":('c','b'), "v":(8,7), "foo":(4,2)})

A left outer join in SQL:

SELECT DT.x, DT.y, DT.v, X.foo FROM DT left JOIN X ON DT.x = X.x

A left outer join in datatable:

X.key = 'x' DT[:, [f.x, f.y, f.v, g.foo], join(X)]
xyvfoo
str32int32int32int32
0b112
1b322
2b632
3a14NA
4a35NA
5a66NA
6c174
7c384
8c694

UNION

The UNION ALL clause in SQL can be replicated in datatable with rbind().

SELECT x, v FROM DT UNION ALL SELECT x, v FROM x

In datatable, rbind() takes a list/tuple of frames and lumps into one:

dt.rbind([DT[:, ('x','v')], X[:, ('x', 'v')]])
xv
str32int32
0b1
1b2
2b3
3a4
4a5
5a6
6c7
7c8
8c9
9b7
10c8

SQL’s UNION removes duplicate rows after combining the results of the individual queries; there is no built-in function in datatable yet that handles duplicates.

SQL’s WINDOW functions

Some SQL window functions can be replicated in datatable (rank is one of the windows function not currently implemented in datatable) :

  • TOP n rows per group

SELECT * from (SELECT *, ROW_NUMBER() OVER(PARTITION BY species ORDER BY sepal_length DESC) AS row_num FROM iris) WHERE row_num < 3;
iris[:3, :, by('species'), sort(-f.sepal_length)]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
str32float64float64float64float64
0setosa5.841.20.2
1setosa5.74.41.50.4
2setosa5.73.81.70.3
3versicolor73.24.71.4
4versicolor6.93.14.91.5
5versicolor6.82.84.81.4
6virginica7.93.86.42
7virginica7.73.86.72.2
8virginica7.72.66.92.3

Filter for rows above the mean sepal length:

SELECT sepal_length, sepal_width, petal_length, petal_width, species FROM (SELECT *, AVG(sepal_length) OVER (PARTITION BY species) AS avg_sepal_length FROM iris) WHERE sepal_length > avg_sepal_length LIMIT 5;
iris[:, update(temp = f.sepal_length > dt.mean(f.sepal_length)), by('species')] iris[f.temp == 1, f[:-1]].head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
15.43.91.70.4setosa
25.43.71.50.2setosa
35.841.20.2setosa
45.74.41.50.4setosa

Lead and lag

SELECT name, destination, dep_date, LEAD(dep_date) OVER (ORDER BY dep_date, name) AS lead1, LEAD(dep_date, 2) OVER (ORDER BY dep_date, name) AS lead2, LAG(dep_date) OVER (ORDER BY dep_date, name) AS lag1, LAG(dep_date, 3) OVER (ORDER BY dep_date, name) AS lag3 FROM source_data;
source_data = dt.Frame({'name': ['Ann', 'Ann', 'Ann', 'Bob', 'Bob'], 'destination': ['Japan', 'Korea', 'Switzerland', 'USA', 'Switzerland'], 'dep_date': ['2019-02-02', '2019-01-01', '2020-01-11', '2019-05-05', '2020-01-11'], 'duration': [7, 21, 14, 10, 14]}) source_data[:, f[:].extend({"lead1": dt.shift(f.dep_date, -1), "lead2": dt.shift(f.dep_date, -2), "lag1": dt.shift(f.dep_date), "lag3": dt.shift(f.dep_date,3) }), sort('dep_date','name')]
namedestinationdep_datedurationlead1lead2lag1lag3
str32str32str32int32str32str32str32str32
0AnnKorea2019-01-01212019-02-022019-05-05NANA
1AnnJapan2019-02-0272019-05-052020-01-112019-01-01NA
2BobUSA2019-05-05102020-01-112020-01-112019-02-02NA
3AnnSwitzerland2020-01-11142020-01-11NA2019-05-052019-01-01
4BobSwitzerland2020-01-1114NANA2020-01-112019-02-02

The equivalent of SQL’s LAG is shift() with a positive number, while SQL’s LEAD is shift() with a negative number.

Note

datatable does not natively support datetimes yet.

Total sum and the proportions:

proportions = dt.Frame({"t": [1, 2, 3]}) proportions
t
int32
01
12
23
SELECT t, SUM(t) OVER () AS sum, CAST(t as FLOAT)/SUM(t) OVER () AS pct FROM proportions;
proportions[:, f[:].extend({"sum": dt.sum(f.t), "pct": f.t/dt.sum(f.t)})]
tsumpct
int32int64float64
0160.166667
1260.333333
2360.5