Grouping with by()

The by() modifier splits a dataframe into groups, either via the provided column(s) or f-expressions, and then applies i and j within each group. This split-apply-combine strategy allows for a number of operations:

  • Aggregations per group,

  • Transformation of a column or columns, where the shape of the dataframe is maintained,

  • Filtration, where some data are kept and the others discarded, based on a condition or conditions.

Aggregation

The aggregate function is applied in the j section.

Group by one column:

from datatable import (dt, f, by, ifelse, update, sort, count, min, max, mean, sum, rowsum) df = dt.Frame("""Fruit Date Name Number Apples 10/6/2016 Bob 7 Apples 10/6/2016 Bob 8 Apples 10/6/2016 Mike 9 Apples 10/7/2016 Steve 10 Apples 10/7/2016 Bob 1 Oranges 10/7/2016 Bob 2 Oranges 10/6/2016 Tom 15 Oranges 10/6/2016 Mike 57 Oranges 10/6/2016 Bob 65 Oranges 10/7/2016 Tony 1 Grapes 10/7/2016 Bob 1 Grapes 10/7/2016 Tom 87 Grapes 10/7/2016 Bob 22 Grapes 10/7/2016 Bob 12 Grapes 10/7/2016 Tony 15""") df[:, sum(f.Number), by('Fruit')]
FruitNumber
str32int64
0Apples35
1Grapes137
2Oranges140

Group by multiple columns:

df[:, sum(f.Number), by('Fruit', 'Name')]
FruitNameNumber
str32str32int64
0ApplesBob16
1ApplesMike9
2ApplesSteve10
3GrapesBob35
4GrapesTom87
5GrapesTony15
6OrangesBob67
7OrangesMike57
8OrangesTom15
9OrangesTony1

By column position:

df[:, sum(f.Number), by(f[0])]
FruitNumber
str32int64
0Apples35
1Grapes137
2Oranges140

By boolean expression:

df[:, sum(f.Number), by(f.Fruit == "Apples")]
C0Number
bool8int64
00277
1135

Combination of column and boolean expression:

df[:, sum(f.Number), by(f.Name, f.Fruit == "Apples")]
NameC0Number
str32bool8int64
0Bob0102
1Bob116
2Mike057
3Mike19
4Steve110
5Tom0102
6Tony016

The grouping column can be excluded from the final output:

df[:, sum(f.Number), by('Fruit', add_columns=False)]
Number
int64
035
1137
2140

Note

  • The resulting dataframe has the grouping column(s) as the first column(s).

  • The grouping columns are excluded from j, unless explicitly included.

  • The grouping columns are sorted in ascending order.

Apply multiple aggregate functions to a column in the j section:

df[:, {"min": min(f.Number), "max": max(f.Number)}, by('Fruit','Date')]
FruitDateminmax
str32str32int32int32
0Apples10/6/201679
1Apples10/7/2016110
2Grapes10/7/2016187
3Oranges10/6/20161565
4Oranges10/7/201612

Functions can be applied across a columnset. Task : Get sum of col3 and col4, grouped by col1 and col2:

df = dt.Frame(""" col1 col2 col3 col4 col5 a c 1 2 f a c 1 2 f a d 1 2 f b d 1 2 g b e 1 2 g b e 1 2 g""") df[:, sum(f["col3":"col4"]), by('col1', 'col2')]
col1col2col3col4
str32str32int64int64
0ac24
1ad12
2bd12
3be24

Apply different aggregate functions to different columns:

df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]
col1col2col3col4
str32str32int8int32
0ac12
1ad12
2bd12
3be12

Nested aggregations in j. Task : Group by column idx and get the row sum of A and B, C and D:

df = dt.Frame(""" idx A B C D cat J 1 2 3 1 x K 4 5 6 2 x L 7 8 9 3 y M 1 2 3 4 y N 4 5 6 5 z O 7 8 9 6 z""") df[:, {"AB" : sum(rowsum(f['A':'B'])), "CD" : sum(rowsum(f['C':'D']))}, by('cat') ]
catABCD
str32int64int64
0x1212
1y1819
2z2426

Computation between aggregated columns. Task: get the difference between the largest and smallest value within each group:

df = dt.Frame("""GROUP VALUE 1 5 2 2 1 10 2 20 1 7""") df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]
GROUPC0
int32int32
015
1218

Null values are not excluded from the grouping column:

df = dt.Frame(""" a b c 1 2.0 3 1 NaN 4 2 1.0 3 1 2.0 2""") df[:, sum(f[:]), by('b')]
bac
float64int64int64
0NA14
1123
2225

If you wish to ignore null values, first filter them out:

df[f.b != None, :][:, sum(f[:]), by('b')]
bac
float64int64int64
0123
1225

Filtration

This occurs in the i section of the groupby, where only a subset of the data per group is needed; selection is limited to integers or slicing.

Note

  • i is applied after the grouping, not before.

  • f-expressions in the i section is not yet implemented for groupby.

Select the first row per group:

df = dt.Frame("""A B 1 10 1 20 2 30 2 40 3 10""") # passing 0 as index gets the first row after the grouping # note that python's index starts from 0, not 1 df[0, :, by('A')]
AB
int32int32
0110
1230
2310

Select the last row per group:

df[-1, :, by('A')]
AB
int32int32
0120
1240
2310

Select the nth row per group. Task : select the second row per group:

df[1, :, by('A')]
AB
int32int32
0120
1240

Note

Filtering this way can be used to drop duplicates; you can decide to keep the first or last non-duplicate.

Select the latest entry per group:

df = dt.Frame(""" id product date 220 6647 2014-09-01 220 6647 2014-09-03 220 6647 2014-10-16 826 3380 2014-11-11 826 3380 2014-12-09 826 3380 2015-05-19 901 4555 2014-09-01 901 4555 2014-10-05 901 4555 2014-11-01""") df[-1, :, by('id'), sort('date')]
idproductdate
int32int32str32
022066472014-10-16
182633802015-05-19
290145552014-11-01

Note

If sort and by modifiers are present, the sorting occurs after the grouping, and occurs within each group.

Replicate SQL’s HAVING clause. Task: Filter for groups where the length/count is greater than 1:

df = dt.Frame([[1, 1, 5], [2, 3, 6]], names=['A', 'B']) df
AB
int32int32
012
113
256
# Get the count of each group, # and assign to a new column, using the update method # note that the update operation is in-place; # there is no need to assign back to the dataframe df[:, update(filter_col = count()), by('A')] # The new column will be added to the end # We use an f-expression to return rows # in each group where the count is greater than 1 df[f.filter_col > 1, f[:-1]]
AB
int32int32
012
113

Keep only rows per group where diff is the minimum:

df = dt.Frame(""" item diff otherstuff 1 2 1 1 1 2 1 3 7 2 -1 0 2 1 3 2 4 9 2 -6 2 3 0 0 3 2 9""") df[:, #get boolean for rows where diff column is minimum for each group update(filter_col = f.diff == min(f.diff)), by('item')] df[f.filter_col == 1, :-1]
itemdiffotherstuff
int32int32int32
0112
12-62
2300

Keep only entries where make has both 0 and 1 in sales:

df = dt.Frame(""" make country other_columns sale honda tokyo data 1 honda hirosima data 0 toyota tokyo data 1 toyota hirosima data 0 suzuki tokyo data 0 suzuki hirosima data 0 ferrari tokyo data 1 ferrari hirosima data 0 nissan tokyo data 1 nissan hirosima data 0""") df[:, update(filter_col = sum(f.sale)), by('make')] df[f.filter_col == 1, :-1]
makecountryother_columnssale
str32str32str32bool8
0hondatokyodata1
1hondahirosimadata0
2toyotatokyodata1
3toyotahirosimadata0
4ferraritokyodata1
5ferrarihirosimadata0
6nissantokyodata1
7nissanhirosimadata0

Transformation

This is when a function is applied to a column after a groupby and the resulting column is appended back to the dataframe. The number of rows of the dataframe is unchanged. The update() method makes this possible and easy.

Get the minimum and maximum of column c per group, and append to dataframe:

df = dt.Frame(""" c y 9 0 8 0 3 1 6 2 1 3 2 3 5 3 4 4 0 4 7 4""") # Assign the new columns via the update method df[:, update(min_col = min(f.c), max_col = max(f.c)), by('y')] df
cymin_colmax_col
int32int32int32int32
09089
18089
23133
36266
41315
52315
65315
74407
80407
97407

Fill missing values by group mean:

df = dt.Frame({'value' : [1, None, None, 2, 3, 1, 3, None, 3], 'name' : ['A','A', 'B','B','B','B', 'C','C','C']}) df
valuename
float64str32
01A
1NAA
2NAB
32B
43B
51B
63C
7NAC
83C
# This uses a combination of update and ifelse methods: df[:, update(value = ifelse(f.value == None, mean(f.value), f.value)), by('name')] df
valuename
float64str32
01A
11A
22B
32B
43B
51B
63C
73C
83C

Transform and Aggregate on multiple columns

Task: Get the sum of the aggregate of column a and b, grouped by c and d and append to dataframe:

df = dt.Frame({'a' : [1,2,3,4,5,6], 'b' : [1,2,3,4,5,6], 'c' : ['q', 'q', 'q', 'q', 'w', 'w'], 'd' : ['z','z','z','o','o','o']}) df
abcd
int32int32str32str32
011qz
122qz
233qz
344qo
455wo
566wo
df[:, update(e = sum(f.a) + sum(f.b)), by('c', 'd') ] df
abcde
int32int32str32str32int64
011qz12
122qz12
233qz12
344qo8
455wo22
566wo22

Replicate R’s groupby mutate

Task : Get ratio by dividing column c by the product of column c and d, grouped by a and b:

df = dt.Frame(dict(a = (1,1,0,1,0), b = (1,0,0,1,0), c = (10,5,1,5,10), d = (3,1,2,1,2)) ) df
abcd
int8int8int32int32
011103
11051
20012
31151
400102
df[:, update(ratio = f.c / sum(f.c * f.d)), by('a', 'b') ] df
abcdratio
int8int8int32int32float64
0111030.285714
110511
200120.0454545
311510.142857
4001020.454545

Groupby on boolean expressions

Conditional sum with groupby

Task: sum data1 column, grouped by key1 and rows where key2 == "one":

df = dt.Frame("""data1 data2 key1 key2 0.361601 0.375297 a one 0.069889 0.809772 a two 1.468194 0.272929 b one -1.138458 0.865060 b two -0.268210 1.250340 a one""")
>>> >>>
df[:, sum(f.data1), by(f.key2 == "one", f.key1)][f.C0 == 1, 1:]
key1data1
str32float64
0a0.093391
1b1.46819

Conditional sums based on various criteria

df = dt.Frame(""" A_id B C a1 "up" 100 a2 "down" 102 a3 "up" 100 a3 "up" 250 a4 "left" 100 a5 "right" 102""") df[:, {"sum_up": sum(f.B == "up"), "sum_down" : sum(f.B == "down"), "over_200_up" : sum((f.B == "up") & (f.C > 200)) }, by('A_id')]
A_idsum_upsum_downover_200_up
str32int64int64int64
0a1100
1a2010
2a3201
3a4000
4a5000

More Examples

Aggregation on values in a column

Task: group by Day and find minimum Data_Value for elements of type TMIN and maximum Data_Value for elements of type TMAX:

df = dt.Frame(""" Day Element Data_Value 01-01 TMAX 112 01-01 TMAX 101 01-01 TMIN 60 01-01 TMIN 0 01-01 TMIN 25 01-01 TMAX 113 01-01 TMAX 115 01-01 TMAX 105 01-01 TMAX 111 01-01 TMIN 44 01-01 TMIN 83 01-02 TMAX 70 01-02 TMAX 79 01-02 TMIN 0 01-02 TMIN 60 01-02 TMAX 73 01-02 TMIN 31 01-02 TMIN 26 01-02 TMAX 71 01-02 TMIN 26""") df[:, {"TMAX": max(ifelse(f.Element=="TMAX", f.Data_Value, None)), "TMIN": min(ifelse(f.Element=="TMIN", f.Data_Value, None))}, by(f.Day)]
DayTMAXTMIN
str32int32int32
001-011150
101-02790

Group-by and conditional sum and add back to data frame

Task: sum the Count value for each ID, when Num is (17 or 12) and Letter is 'D' and then add the calculation back to the original data frame as column 'Total':

df = dt.Frame(""" ID Num Letter Count 1 17 D 1 1 12 D 2 1 13 D 3 2 17 D 4 2 12 A 5 2 16 D 1 3 16 D 1""") expression = ((f.Num==17) | (f.Num==12)) & (f.Letter == "D") df[:, update(Total = sum(expression * f.Count)), by(f.ID)] df
IDNumLetterCountTotal
int32int32str32int32int64
0117D13
1112D23
2113D33
3217D44
4212A54
5216D14
6316D10

Indexing with multiple min and max in one aggregate

Task : find col1 where col2 is max, col2 where col3 is min and col1 where col3 is max:

df = dt.Frame({ "id" : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3], "col1" : [1, 3, 5, 2, 5, 3, 6, 3, 67, 7], "col2" : [4, 6, 8, 3, 65, 3, 5, 4, 4, 7], "col3" : [34, 64, 53, 5, 6, 2, 4, 6, 4, 67], }) df
idcol1col2col3
int32int32int32int32
011434
113664
215853
32235
425656
52332
62654
73346
836744
937767
df[:, {'col1' : max(ifelse(f.col2 == max(f.col2), f.col1, None)), 'col2' : max(ifelse(f.col3 == min(f.col3), f.col2, None)), 'col3' : max(ifelse(f.col3 == max(f.col3), f.col1, None)) }, by('id')]
idcol1col2col3
int32int32int32int32
01543
12535
23747

Filter rows based on aggregate value

Task: for every word find the tag that has the most count:

df = dt.Frame("""word tag count a S 30 the S 20 a T 60 an T 5 the T 10""") # The solution builds on the knowledge that sorting # while grouping sorts within each group. df[0, :, by('word'), sort(-f.count)]
wordtagcount
str32str32int32
0aT60
1anT5
2theS20

Get the rows where the value column is minimum, and rename columns:

df = dt.Frame({"category": ["A"]*3 + ["B"]*3, "date": ["9/6/2016", "10/6/2016", "11/6/2016", "9/7/2016", "10/7/2016", "11/7/2016"], "value": [7,8,9,10,1,2]}) df
categorydatevalue
str32str32int32
0A9/6/20167
1A10/6/20168
2A11/6/20169
3B9/7/201610
4B10/7/20161
5B11/7/20162
df[0, {"value_date": f.date, "value_min": f.value}, by("category"), sort('value')]
categoryvalue_datevalue_min
str32str32int32
0A9/6/20167
1B10/7/20161

Get the rows where the value column is maximum, and rename columns:

df[0, {"value_date": f.date, "value_max": f.value}, by("category"), sort(-f.value)]
categoryvalue_datevalue_max
str32str32int32
0A11/6/20169
1B9/7/201610

Get the average of the last three instances per group:

import random random.seed(3) df = dt.Frame({"Student": ["Bob", "Bill", "Bob", "Bob", "Bill","Joe", "Joe", "Bill", "Bob", "Joe",], "Score": random.sample(range(10,30), 10)}) df
StudentScore
str32int32
0Bob17
1Bill28
2Bob27
3Bob14
4Bill21
5Joe24
6Joe19
7Bill29
8Bob20
9Joe23
df[-3:, mean(f[:]), f.Student]
StudentScore
str32float64
0Bill26
1Bob20.3333
2Joe22

Group by on a condition

Get the sum of Amount for Number in range (1 to 4) and (5 and above):

df = dt.Frame("""Number, Amount 1, 5 2, 10 3, 11 4, 3 5, 5 6, 8 7, 9 8, 6""") df[:, sum(f.Amount), by(ifelse(f.Number>=5, "B", "A"))]
C0Amount
str32int64
0A29
1B28