Comparison with SQL¶

This page provides some examples of how various SQL operations can be performed in datatable. The datatable library is still growing; as such, not all functions in SQL can be replicated yet. If there is a feature you would love to have in datatable, please make a feature request on the github issues page.

Most of the examples will be based on the famous iris dataset. SQLite will be the flavour of SQL used in the comparison.

Let’s import datatable and read in the data using its fread() function:

from datatable import dt, f, g, by, join, sort, update, fread

iris = fread('https://raw.githubusercontent.com/h2oai/datatable/main/docs/_static/iris.csv')
iris
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
13.51.40.2setosa
931.40.2setosa
73.21.30.2setosa
63.11.50.2setosa
61.40.2setosa
43.91.70.4setosa
63.41.40.3setosa
41.50.2setosa
42.91.40.2setosa
93.11.50.1setosa
43.71.50.2setosa
83.41.60.2setosa
831.40.1setosa
331.10.1setosa
841.20.2setosa
………………
735.22.3virginica
32.551.9virginica
535.22virginica
23.45.42.3virginica
935.11.8virginica
rows × 5 columns

Loading data into an SQL table is a bit more involved, where you need to create the structure of the table (a schema), before importing the csv file. Have a look at SQLite import tutorial for an example on loading data into a SQLite datatabase.

SELECT¶

In SQL, you can select a subset of the columns with the SELECT clause:

SELECT sepal_length,
       sepal_width,
       petal_length
FROM iris
LIMIT 5;

In datatable, columns are selected in the j section:

iris[:5, ['sepal_length', 'sepal_width', 'petal_length']]
sepal_lengthsepal_widthpetal_length
float64float64float64
05.13.51.4
14.931.4
24.73.21.3
34.63.11.5
453.61.4
5 rows × 3 columns

In SQL, you can select all columns with the * symbol:

SELECT *
FROM iris
LIMIT 5;

In datatable, all columns can be selected with a simple “select-all” slice :, or with f-expressions:

iris[:5, :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
5 rows × 5 columns

If you are selecting a single column, datatable allows you to access just the j section within the square brackets; you do not need to include the i section: DT[j]

SELECT sepal_length
FROM iris
LIMIT 5;

# datatable
iris['sepal_length'].head(5)
sepal_length
float64
05.1
14.9
24.7
34.6
45
5 rows × 1 column

How about adding new columns? In SQL, this is done also in the SELECT clause:

SELECT *,
       sepal_length*2 AS sepal_length_doubled
FROM iris
LIMIT 5;

In datatable, addition of new columns occurs in the j section:

iris[:5,
     f[:].extend({"sepal_length_doubled": f.sepal_length * 2})]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
05.13.51.40.2setosa10.2
14.931.40.2setosa9.8
24.73.21.30.2setosa9.4
34.63.11.50.2setosa9.2
453.61.40.2setosa10
5 rows × 6 columns

The update() function can also be used to add new columns. The operation occurs in-place; reassignment is not required:

iris[:, update(sepal_length_doubled = f.sepal_length * 2)]
iris[:5, :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
05.13.51.40.2setosa10.2
14.931.40.2setosa9.8
24.73.21.30.2setosa9.4
34.63.11.50.2setosa9.2
453.61.40.2setosa10
5 rows × 6 columns

WHERE¶

Filtering in SQL is done via the WHERE clause.

SELECT *
FROM iris
WHERE species = 'virginica'
LIMIT 5;

In datatable, filtration is done in the i section:

iris[f.species=="virginica", :].head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
06.33.362.5virginica12.6
15.82.75.11.9virginica11.6
27.135.92.1virginica14.2
36.32.95.61.8virginica12.6
46.535.82.2virginica13
5 rows × 6 columns

Note that in SQL, equality comparison is done with the = symbol, whereas in python, it is with the == operator. You can filter with multple conditions too:

SELECT *
FROM iris
WHERE species = 'setosa' AND sepal_length = 5;

In datatable each condition is wrapped in parentheses; the & operator is the equivalent of AND, while | is the equivalent of OR:

iris[(f.species=="setosa") & (f.sepal_length==5), :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
61.40.2setosa10
41.50.2setosa10
60.2setosa10
41.60.4setosa10
21.20.2setosa10
51.30.3setosa10
51.60.6setosa10
31.40.2setosa10
rows × 6 columns

Now suppose you have a frame where some values are missing (NA):

null_data = dt.Frame(""" a    b    c
                         1    2    3
                         1    NaN  4
                         2    1    3
                         1    2    2""")
null_data
abc
int32float64int32
0123
11NA4
2213
3122
4 rows × 3 columns

In SQL you could filter out those values like this:

SELECT *
FROM null_data
WHERE b is NOT NULL;

In datatable, the NOT operator is replicated with the != symbol:

null_data[f.b!=None, :]
abc
int32float64int32
0123
1213
2122
3 rows × 3 columns

You could also use isna function with the ~ operator which inverts boolean expressions:

null_data[~dt.math.isna(f.b), :]
abc
int32float64int32
0123
1213
2122
3 rows × 3 columns

Keeping the null rows is easily achievable; it is simply the inverse of the above code:

SELECT *
FROM null_data
WHERE b is NULL;

null_data[dt.isna(f.b), :]
abc
int32float64int32
01NA4
1 row × 3 columns

null_data[dt.isna(f.b), :]

Note

SQL has the IN operator, which does not have an equivalent in datatable yet.

ORDER BY¶

In SQL, sorting is executed with the ORDER BY clause, while in datatable it is handled by the sort() function.

SELECT *
FROM iris
ORDER BY sepal_length ASC
limit 5;

iris[:5, :, sort('sepal_length')]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
04.331.10.1setosa8.6
14.42.91.40.2setosa8.8
24.431.30.2setosa8.8
34.43.21.30.2setosa8.8
44.52.31.30.3setosa9
5 rows × 6 columns

Sorting in descending order in SQL is with the DESC.

SELECT *
FROM iris
ORDER BY sepal_length DESC
limit 5;

In datatable, this can be achieved in two ways:

iris[:5, :, sort('sepal_length', reverse=True)]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
07.93.86.42virginica15.8
17.73.86.72.2virginica15.4
27.72.66.92.3virginica15.4
37.72.86.72virginica15.4
47.736.12.3virginica15.4
5 rows × 6 columns

or, you could negate the sorting column; datatable will correctly interprete the negation(-) as descending order:

iris[:5, :, sort(-f.sepal_length)]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
07.93.86.42virginica15.8
17.73.86.72.2virginica15.4
27.72.66.92.3virginica15.4
37.72.86.72virginica15.4
47.736.12.3virginica15.4
5 rows × 6 columns

GROUP BY¶

SQL’s GROUP BY operations can be performed in datatable with the by() function. Have a look at the by() API, as well as the Grouping with by() user guide.

Let’s look at some common grouping operations in SQL, and their equivalents in datatable.

Single aggregation per group¶

SELECT species,
       COUNT() AS N
FROM iris
GROUP BY species;

iris[:, dt.count(), by('species')]
speciescount
str32int64
0setosa50
1versicolor50
2virginica50
3 rows × 2 columns

Multiple aggregations per group¶

SELECT species,
       COUNT() AS N,
       AVG(sepal_length) AS mean_sepal_length
FROM iris
GROUP BY species;

iris[:,
    {"mean_sepal_length": dt.mean(f.sepal_length),
    "N": dt.count()},
    by('species')]
speciesmean_sepal_lengthN
str32float64int64
0setosa5.00650
1versicolor5.93650
2virginica6.58850
3 rows × 3 columns

Grouping on multiple columns¶

fruits_data
FruitDateNameNumber
str32str32str32int32
0Apples10/6/2016Bob7
1Apples10/6/2016Bob8
2Apples10/6/2016Mike9
3Apples10/7/2016Steve10
4Apples10/7/2016Bob1
5Oranges10/7/2016Bob2
6Oranges10/6/2016Tom15
7Oranges10/6/2016Mike57
8Oranges10/6/2016Bob65
9Oranges10/7/2016Tony1
10Grapes10/7/2016Bob1
11Grapes10/7/2016Tom87
12Grapes10/7/2016Bob22
13Grapes10/7/2016Bob12
14Grapes10/7/2016Tony15
15 rows × 4 columns

SELECT fruit,
       name,
       SUM(number) AS sum_num
FROM fruits_data
GROUP BY fruit, name;

fruits_data[:,
            {"sum_num": dt.sum(f.Number)},
            by('Fruit', 'Name')]
FruitNamesum_num
str32str32int64
0ApplesBob16
1ApplesMike9
2ApplesSteve10
3GrapesBob35
4GrapesTom87
5GrapesTony15
6OrangesBob67
7OrangesMike57
8OrangesTom15
9OrangesTony1
10 rows × 3 columns

WHERE with GROUP BY¶

SELECT species,
       AVG(sepal_length) AS avg_sepal_length
FROM iris
WHERE sepal_width > 3
GROUP BY species;

iris[f.sepal_width >=3, :][:,
                          {"avg_sepal_length": dt.mean(f.sepal_length)},
                          by('species')]
speciesavg_sepal_length
str32float64
0setosa5.02917
1versicolor6.21875
2virginica6.76897
3 rows × 2 columns

HAVING with GROUP BY¶

SELECT fruit,
       name,
       SUM(number) AS sum_num
FROM fruits_data
GROUP BY fruit, name
HAVING sum_num > 50;

fruits_data[:,
           {'sum_num': dt.sum(f.Number)},
           by('Fruit','Name')][f.sum_num > 50, :]
FruitNamesum_num
str32str32int64
0GrapesTom87
1OrangesBob67
2OrangesMike57
3 rows × 3 columns

Grouping on a condition¶

SELECT sepal_width >=3 AS width_larger_than_3,
       AVG(sepal_length) AS avg_sepal_length
FROM iris
GROUP BY sepal_width>=3;

iris[:,
     {"avg_sepal_length": dt.mean(f.sepal_length)},
     by(f.sepal_width >= 3)]
C0avg_sepal_length
bool8float64
005.95263
115.77634
2 rows × 2 columns

At the moment, names cannot be assigned in the by section.

LEFT OUTER JOIN¶

We will compare the left outer join, as that is the only join currently implemented in datatable. Another aspect is that the frame being joined must be keyed, the column or columns to be keyed must not have duplicates, and the joining column has to have the same name in both frames. You can read more about the join() API and have a look at the join(…).

Example data:

DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
              y = [1, 3, 6] * 3,
              v = range(1, 10))

X = dt.Frame({"x":('c','b'),
              "v":(8,7),
              "foo":(4,2)})

A left outer join in SQL:

SELECT DT.x,
       DT.y,
       DT.v,
       X.foo
FROM DT
left JOIN X
ON DT.x = X.x

A left outer join in datatable:

X.key = 'x'
DT[:, [f.x, f.y, f.v, g.foo], join(X)]
xyvfoo
str32int32int32int32
0b112
1b322
2b632
3a14NA
4a35NA
5a66NA
6c174
7c384
8c694
9 rows × 4 columns

UNION¶

The UNION ALL clause in SQL can be replicated in datatable with rbind().

SELECT x, v
FROM DT
UNION ALL
SELECT x, v
FROM x

In datatable, rbind() takes a list/tuple of frames and lumps into one:

dt.rbind([DT[:, ('x','v')], X[:, ('x', 'v')]])
xv
str32int32
0b1
1b2
2b3
3a4
4a5
5a6
6c7
7c8
8c9
9b7
10c8
11 rows × 2 columns

SQL’s UNION removes duplicate rows after combining the results of the individual queries; there is no built-in function in datatable yet that handles duplicates.

SQL’s WINDOW functions¶

Some SQL window functions can be replicated in datatable (rank is one of the windows function not currently implemented in datatable) :

TOP n rows per group

SELECT * from
(SELECT *,
       ROW_NUMBER() OVER(PARTITION BY species ORDER BY sepal_length DESC) AS row_num
 FROM iris)
WHERE row_num < 3;

iris[:3, :, by('species'), sort(-f.sepal_length)]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
str32float64float64float64float64
0setosa5.841.20.2
1setosa5.74.41.50.4
2setosa5.73.81.70.3
3versicolor73.24.71.4
4versicolor6.93.14.91.5
5versicolor6.82.84.81.4
6virginica7.93.86.42
7virginica7.73.86.72.2
8virginica7.72.66.92.3
9 rows × 5 columns

Filter for rows above the mean sepal length:

SELECT sepal_length,
       sepal_width,
       petal_length,
       petal_width,
       species
FROM
(SELECT *,
AVG(sepal_length) OVER (PARTITION BY species) AS avg_sepal_length
FROM iris)
WHERE sepal_length > avg_sepal_length
LIMIT 5;

iris[:,
     update(temp = f.sepal_length > dt.mean(f.sepal_length)),
     by('species')]

iris[f.temp == 1, f[:-1]].head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
15.43.91.70.4setosa
25.43.71.50.2setosa
35.841.20.2setosa
45.74.41.50.4setosa
5 rows × 5 columns

Lead and lag

SELECT name,
       destination,
       dep_date,
       LEAD(dep_date) OVER (ORDER BY dep_date, name) AS lead1,
       LEAD(dep_date, 2) OVER (ORDER BY dep_date, name) AS lead2,
       LAG(dep_date) OVER (ORDER BY dep_date, name) AS lag1,
       LAG(dep_date, 3) OVER (ORDER BY dep_date, name) AS lag3
FROM source_data;

source_data = dt.Frame({'name': ['Ann', 'Ann', 'Ann', 'Bob', 'Bob'],
                        'destination': ['Japan', 'Korea', 'Switzerland',
                                        'USA', 'Switzerland'],
                        'dep_date': ['2019-02-02', '2019-01-01',
                                     '2020-01-11', '2019-05-05',
                                     '2020-01-11'],
                        'duration': [7, 21, 14, 10, 14]})
source_data[:,
            f[:].extend({"lead1": dt.shift(f.dep_date, -1),
                         "lead2": dt.shift(f.dep_date, -2),
                         "lag1": dt.shift(f.dep_date),
                         "lag3": dt.shift(f.dep_date,3)
                         }),
            sort('dep_date','name')]
namedestinationdep_datedurationlead1lead2lag1lag3
str32str32str32int32str32str32str32str32
0AnnKorea2019-01-01212019-02-022019-05-05NANA
1AnnJapan2019-02-0272019-05-052020-01-112019-01-01NA
2BobUSA2019-05-05102020-01-112020-01-112019-02-02NA
3AnnSwitzerland2020-01-11142020-01-11NA2019-05-052019-01-01
4BobSwitzerland2020-01-1114NANA2020-01-112019-02-02
5 rows × 8 columns

The equivalent of SQL’s LAG is shift() with a positive number, while SQL’s LEAD is shift() with a negative number.

Note

datatable does not natively support datetimes yet.

Total sum and the proportions:

proportions = dt.Frame({"t": [1, 2, 3]})
proportions
t
int32
01
12
23
3 rows × 1 column

SELECT  t,
        SUM(t) OVER () AS sum,
        CAST(t as FLOAT)/SUM(t) OVER () AS pct
FROM proportions;

proportions[:,
            f[:].extend({"sum": dt.sum(f.t),
                        "pct": f.t/dt.sum(f.t)})]
tsumpct
int32int64float64
0160.166667
1260.333333
2360.5
3 rows × 3 columns

	sepal_length	sepal_width	petal_length	petal_width	species
	float64	float64	float64	float64	str32
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
6	4.6	3.4	1.4	0.3	setosa
7	5	3.4	1.5	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
9	4.9	3.1	1.5	0.1	setosa
10	5.4	3.7	1.5	0.2	setosa
11	4.8	3.4	1.6	0.2	setosa
12	4.8	3	1.4	0.1	setosa
13	4.3	3	1.1	0.1	setosa
14	5.8	4	1.2	0.2	setosa
…	…	…	…	…	…
145	6.7	3	5.2	2.3	virginica
146	6.3	2.5	5	1.9	virginica
147	6.5	3	5.2	2	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3	5.1	1.8	virginica

	species	mean_sepal_length	N
	str32	float64	int64
0	setosa	5.006	50
1	versicolor	5.936	50
2	virginica	6.588	50

	x	y	v	foo
	str32	int32	int32	int32
0	b	1	1	2
1	b	3	2	2
2	b	6	3	2
3	a	1	4	NA
4	a	3	5	NA
5	a	6	6	NA
6	c	1	7	4
7	c	3	8	4
8	c	6	9	4

	Fruit	Date	Name	Number
	str32	str32	str32	int32
0	Apples	10/6/2016	Bob	7
1	Apples	10/6/2016	Bob	8
2	Apples	10/6/2016	Mike	9
3	Apples	10/7/2016	Steve	10
4	Apples	10/7/2016	Bob	1
5	Oranges	10/7/2016	Bob	2
6	Oranges	10/6/2016	Tom	15
7	Oranges	10/6/2016	Mike	57
8	Oranges	10/6/2016	Bob	65
9	Oranges	10/7/2016	Tony	1
10	Grapes	10/7/2016	Bob	1
11	Grapes	10/7/2016	Tom	87
12	Grapes	10/7/2016	Bob	22
13	Grapes	10/7/2016	Bob	12
14	Grapes	10/7/2016	Tony	15

	Fruit	Name	sum_num
	str32	str32	int64
0	Apples	Bob	16
1	Apples	Mike	9
2	Apples	Steve	10
3	Grapes	Bob	35
4	Grapes	Tom	87
5	Grapes	Tony	15
6	Oranges	Bob	67
7	Oranges	Mike	57
8	Oranges	Tom	15
9	Oranges	Tony	1

	Fruit	Name	sum_num
	str32	str32	int64
0	Grapes	Tom	87
1	Oranges	Bob	67
2	Oranges	Mike	57

	C0	avg_sepal_length
	bool8	float64
0	0	5.95263
1	1	5.77634