Grouping with `by()`¶

The by() modifier splits a dataframe into groups, either via the provided column(s) or f-expressions, and then applies i and j within each group. This split-apply-combine strategy allows for a number of operations:

Aggregations per group,
Transformation of a column or columns, where the shape of the dataframe is maintained,
Filtration, where some data are kept and the others discarded, based on a condition or conditions.

Aggregation¶

The aggregate function is applied in the j section.

Group by one column:

from datatable import (dt, f, by, ifelse, update, sort,
                       count, min, max, mean, sum, rowsum)

df =  dt.Frame("""Fruit   Date       Name  Number
                  Apples  10/6/2016  Bob     7
                  Apples  10/6/2016  Bob     8
                  Apples  10/6/2016  Mike    9
                  Apples  10/7/2016  Steve  10
                  Apples  10/7/2016  Bob     1
                  Oranges 10/7/2016  Bob     2
                  Oranges 10/6/2016  Tom    15
                  Oranges 10/6/2016  Mike   57
                  Oranges 10/6/2016  Bob    65
                  Oranges 10/7/2016  Tony    1
                  Grapes  10/7/2016  Bob     1
                  Grapes  10/7/2016  Tom    87
                  Grapes  10/7/2016  Bob    22
                  Grapes  10/7/2016  Bob    12
                  Grapes  10/7/2016  Tony   15""")

df[:, sum(f.Number), by('Fruit')]
FruitNumber
str32int64
0Apples35
1Grapes137
2Oranges140
3 rows × 2 columns

Group by multiple columns:

df[:, sum(f.Number), by('Fruit', 'Name')]
FruitNameNumber
str32str32int64
0ApplesBob16
1ApplesMike9
2ApplesSteve10
3GrapesBob35
4GrapesTom87
5GrapesTony15
6OrangesBob67
7OrangesMike57
8OrangesTom15
9OrangesTony1
10 rows × 3 columns

By column position:

df[:, sum(f.Number), by(f[0])]
FruitNumber
str32int64
0Apples35
1Grapes137
2Oranges140
3 rows × 2 columns

By boolean expression:

df[:, sum(f.Number), by(f.Fruit == "Apples")]
C0Number
bool8int64
00277
1135
2 rows × 2 columns

Combination of column and boolean expression:

df[:, sum(f.Number), by(f.Name, f.Fruit == "Apples")]
NameC0Number
str32bool8int64
0Bob0102
1Bob116
2Mike057
3Mike19
4Steve110
5Tom0102
6Tony016
7 rows × 3 columns

The grouping column can be excluded from the final output:

df[:, sum(f.Number), by('Fruit', add_columns=False)]
Number
int64
035
1137
2140
3 rows × 1 column

Note

The resulting dataframe has the grouping column(s) as the first column(s).
The grouping columns are excluded from j, unless explicitly included.
The grouping columns are sorted in ascending order.

Apply multiple aggregate functions to a column in the j section:

df[:, {"min": min(f.Number),
       "max": max(f.Number)},
   by('Fruit','Date')]
FruitDateminmax
str32str32int32int32
0Apples10/6/201679
1Apples10/7/2016110
2Grapes10/7/2016187
3Oranges10/6/20161565
4Oranges10/7/201612
5 rows × 4 columns

Functions can be applied across a columnset. Task : Get sum of col3 and col4, grouped by col1 and col2:

df = dt.Frame(""" col1   col2   col3   col4   col5
                  a      c      1      2      f
                  a      c      1      2      f
                  a      d      1      2      f
                  b      d      1      2      g
                  b      e      1      2      g
                  b      e      1      2      g""")

df[:, sum(f["col3":"col4"]), by('col1', 'col2')]
col1col2col3col4
str32str32int64int64
0ac24
1ad12
2bd12
3be24
4 rows × 4 columns

Apply different aggregate functions to different columns:

df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]
col1col2col3col4
str32str32int8int32
0ac12
1ad12
2bd12
3be12
4 rows × 4 columns

Nested aggregations in j. Task : Group by column idx and get the row sum of A and B, C and D:

df = dt.Frame(""" idx  A   B   C   D   cat
                   J   1   2   3   1   x
                   K   4   5   6   2   x
                   L   7   8   9   3   y
                   M   1   2   3   4   y
                   N   4   5   6   5   z
                   O   7   8   9   6   z""")

df[:,
    {"AB" : sum(rowsum(f['A':'B'])),
     "CD" : sum(rowsum(f['C':'D']))},
   by('cat')
   ]
catABCD
str32int64int64
0x1212
1y1819
2z2426
3 rows × 3 columns

Computation between aggregated columns. Task: get the difference between the largest and smallest value within each group:

df = dt.Frame("""GROUP VALUE
                  1     5
                  2     2
                  1     10
                  2     20
                  1     7""")

df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]
GROUPC0
int32int32
015
1218
2 rows × 2 columns

Null values are not excluded from the grouping column:

df = dt.Frame("""  a    b    c
                   1    2.0  3
                   1    NaN  4
                   2    1.0  3
                   1    2.0  2""")

df[:, sum(f[:]), by('b')]
bac
float64int64int64
0NA14
1123
2225
3 rows × 3 columns

If you wish to ignore null values, first filter them out:

df[f.b != None, :][:, sum(f[:]), by('b')]
bac
float64int64int64
0123
1225
2 rows × 3 columns

Filtration¶

This occurs in the i section of the groupby, where only a subset of the data per group is needed; selection is limited to integers or slicing.

Note

i is applied after the grouping, not before.
f-expressions in the i section is not yet implemented for groupby.

Select the first row per group:

df = dt.Frame("""A   B
                 1  10
                 1  20
                 2  30
                 2  40
                 3  10""")

# passing 0 as index gets the first row after the grouping
# note that python's index starts from 0, not 1

df[0, :, by('A')]
AB
int32int32
0110
1230
2310
3 rows × 2 columns

Select the last row per group:

df[-1, :, by('A')]
AB
int32int32
0120
1240
2310
3 rows × 2 columns

Select the nth row per group. Task : select the second row per group:

df[1, :, by('A')]
AB
int32int32
0120
1240
2 rows × 2 columns

Note

Filtering this way can be used to drop duplicates; you can decide to keep the first or last non-duplicate.

Select the latest entry per group:

df =  dt.Frame("""   id    product  date
                    220    6647     2014-09-01
                    220    6647     2014-09-03
                    220    6647     2014-10-16
                    826    3380     2014-11-11
                    826    3380     2014-12-09
                    826    3380     2015-05-19
                    901    4555     2014-09-01
                    901    4555     2014-10-05
                    901    4555     2014-11-01""")

df[-1, :, by('id'), sort('date')]
idproductdate
int32int32str32
022066472014-10-16
182633802015-05-19
290145552014-11-01
3 rows × 3 columns

Note

If sort and by modifiers are present, the sorting occurs after the grouping, and occurs within each group.

Replicate SQL’s HAVING clause. Task: Filter for groups where the length/count is greater than 1:

df = dt.Frame([[1, 1, 5], [2, 3, 6]], names=['A', 'B'])
df
AB
int32int32
012
113
256
3 rows × 2 columns
# Get the count of each group,
# and assign to a new column, using the update method
# note that the update operation is in-place;
# there is no need to assign back to the dataframe
df[:, update(filter_col = count()), by('A')]

# The new column will be added to the end
# We use an f-expression to return rows
# in each group where the count is greater than 1
df[f.filter_col > 1, f[:-1]]
AB
int32int32
012
113
2 rows × 2 columns

Keep only rows per group where diff is the minimum:

df = dt.Frame(""" item    diff   otherstuff
                    1       2            1
                    1       1            2
                    1       3            7
                    2      -1            0
                    2       1            3
                    2       4            9
                    2      -6            2
                    3       0            0
                    3       2            9""")

df[:,
   #get boolean for rows where diff column is minimum for each group
   update(filter_col = f.diff == min(f.diff)),
   by('item')]

df[f.filter_col == 1, :-1]
itemdiffotherstuff
int32int32int32
0112
12-62
2300
3 rows × 3 columns

Keep only entries where make has both 0 and 1 in sales:

df  =  dt.Frame(""" make    country  other_columns   sale
                    honda    tokyo       data          1
                    honda    hirosima    data          0
                    toyota   tokyo       data          1
                    toyota   hirosima    data          0
                    suzuki   tokyo       data          0
                    suzuki   hirosima    data          0
                    ferrari  tokyo       data          1
                    ferrari  hirosima    data          0
                    nissan   tokyo       data          1
                    nissan   hirosima    data          0""")

df[:,
   update(filter_col = sum(f.sale)),
   by('make')]

df[f.filter_col == 1, :-1]
makecountryother_columnssale
str32str32str32bool8
0hondatokyodata1
1hondahirosimadata0
2toyotatokyodata1
3toyotahirosimadata0
4ferraritokyodata1
5ferrarihirosimadata0
6nissantokyodata1
7nissanhirosimadata0
8 rows × 4 columns

Transformation¶

This is when a function is applied to a column after a groupby and the resulting column is appended back to the dataframe. The number of rows of the dataframe is unchanged. The update() method makes this possible and easy.

Get the minimum and maximum of column c per group, and append to dataframe:

df = dt.Frame(""" c     y
                  9     0
                  8     0
                  3     1
                  6     2
                  1     3
                  2     3
                  5     3
                  4     4
                  0     4
                  7     4""")

# Assign the new columns via the update method
df[:,
   update(min_col = min(f.c),
          max_col = max(f.c)),
  by('y')]
df
cymin_colmax_col
int32int32int32int32
09089
18089
23133
36266
41315
52315
65315
74407
80407
97407
10 rows × 4 columns

Fill missing values by group mean:

df = dt.Frame({'value' : [1, None, None, 2, 3, 1, 3, None, 3],
               'name' : ['A','A', 'B','B','B','B', 'C','C','C']})

df
valuename
float64str32
01A
1NAA
2NAB
32B
43B
51B
63C
7NAC
83C
9 rows × 2 columns
# This uses a combination of update and ifelse methods:
df[:,
   update(value = ifelse(f.value == None,
                         mean(f.value),
                         f.value)),
   by('name')]

df
valuename
float64str32
01A
11A
22B
32B
43B
51B
63C
73C
83C
9 rows × 2 columns

Transform and Aggregate on multiple columns¶

Task: Get the sum of the aggregate of column a and b, grouped by c and d and append to dataframe:

df = dt.Frame({'a' : [1,2,3,4,5,6],
               'b' : [1,2,3,4,5,6],
               'c' : ['q', 'q', 'q', 'q', 'w', 'w'],
               'd' : ['z','z','z','o','o','o']})
df
abcd
int32int32str32str32
011qz
122qz
233qz
344qo
455wo
566wo
6 rows × 4 columns
df[:,
   update(e = sum(f.a) + sum(f.b)),
   by('c', 'd')
   ]

df
abcde
int32int32str32str32int64
011qz12
122qz12
233qz12
344qo8
455wo22
566wo22
6 rows × 5 columns

Replicate R’s groupby mutate ¶

Task : Get ratio by dividing column c by the product of column c and d, grouped by a and b:

df = dt.Frame(dict(a = (1,1,0,1,0),
                   b = (1,0,0,1,0),
                   c = (10,5,1,5,10),
                   d = (3,1,2,1,2))
              )
df
abcd
int8int8int32int32
011103
11051
20012
31151
400102
5 rows × 4 columns
df[:,
   update(ratio = f.c / sum(f.c * f.d)),
   by('a', 'b')
   ]
df
abcdratio
int8int8int32int32float64
0111030.285714
110511
200120.0454545
311510.142857
4001020.454545
5 rows × 5 columns

Groupby on boolean expressions¶

Conditional sum with groupby¶

Task: sum data1 column, grouped by key1 and rows where key2 == "one":

df = dt.Frame("""data1        data2     key1  key2
                 0.361601    0.375297     a    one
                 0.069889    0.809772     a    two
                 1.468194    0.272929     b    one
                -1.138458    0.865060     b    two
                -0.268210    1.250340     a    one""")
>>>
>>>
df[:,
   sum(f.data1),
   by(f.key2 == "one", f.key1)][f.C0 == 1, 1:]
key1data1
str32float64
0a0.093391
1b1.46819
2 rows × 2 columns

Conditional sums based on various criteria¶

df = dt.Frame(""" A_id       B       C
                    a1      "up"     100
                    a2     "down"    102
                    a3      "up"     100
                    a3      "up"     250
                    a4     "left"    100
                    a5     "right"   102""")

df[:,
   {"sum_up": sum(f.B == "up"),
    "sum_down" : sum(f.B == "down"),
    "over_200_up" : sum((f.B == "up") & (f.C > 200))
    },
   by('A_id')]
A_idsum_upsum_downover_200_up
str32int64int64int64
0a1100
1a2010
2a3201
3a4000
4a5000
5 rows × 4 columns

More Examples¶

Aggregation on values in a column¶

Task: group by Day and find minimum Data_Value for elements of type TMIN and maximum Data_Value for elements of type TMAX:

df = dt.Frame("""  Day    Element  Data_Value
                  01-01   TMAX    112
                  01-01   TMAX    101
                  01-01   TMIN    60
                  01-01   TMIN    0
                  01-01   TMIN    25
                  01-01   TMAX    113
                  01-01   TMAX    115
                  01-01   TMAX    105
                  01-01   TMAX    111
                  01-01   TMIN    44
                  01-01   TMIN    83
                  01-02   TMAX    70
                  01-02   TMAX    79
                  01-02   TMIN    0
                  01-02   TMIN    60
                  01-02   TMAX    73
                  01-02   TMIN    31
                  01-02   TMIN    26
                  01-02   TMAX    71
                  01-02   TMIN    26""")
df[:,
   {"TMAX": max(ifelse(f.Element=="TMAX", f.Data_Value, None)),
    "TMIN": min(ifelse(f.Element=="TMIN", f.Data_Value, None))},
   by(f.Day)]
DayTMAXTMIN
str32int32int32
001-011150
101-02790
2 rows × 3 columns

Group-by and conditional sum and add back to data frame¶

Task: sum the Count value for each ID, when Num is (17 or 12) and Letter is 'D' and then add the calculation back to the original data frame as column 'Total':

df = dt.Frame(""" ID  Num  Letter  Count
                  1   17   D       1
                  1   12   D       2
                  1   13   D       3
                  2   17   D       4
                  2   12   A       5
                  2   16   D       1
                  3   16   D       1""")
expression = ((f.Num==17) | (f.Num==12)) & (f.Letter == "D")
df[:, update(Total = sum(expression * f.Count)),
      by(f.ID)]
df
IDNumLetterCountTotal
int32int32str32int32int64
0117D13
1112D23
2113D33
3217D44
4212A54
5216D14
6316D10
7 rows × 5 columns

Indexing with multiple min and max in one aggregate¶

Task : find col1 where col2 is max, col2 where col3 is min and col1 where col3 is max:

df = dt.Frame({
               "id" : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
               "col1" : [1, 3, 5, 2, 5, 3, 6, 3, 67, 7],
               "col2" : [4, 6, 8, 3, 65, 3, 5, 4, 4, 7],
               "col3" : [34, 64, 53, 5, 6, 2, 4, 6, 4, 67],
               })

df
idcol1col2col3
int32int32int32int32
011434
113664
215853
32235
425656
52332
62654
73346
836744
937767
10 rows × 4 columns
df[:,
   {'col1' : max(ifelse(f.col2 == max(f.col2), f.col1, None)),
    'col2' : max(ifelse(f.col3 == min(f.col3), f.col2, None)),
    'col3' : max(ifelse(f.col3 == max(f.col3), f.col1, None))
    },
   by('id')]
idcol1col2col3
int32int32int32int32
01543
12535
23747
3 rows × 4 columns

Filter rows based on aggregate value¶

Task: for every word find the tag that has the most count:

df = dt.Frame("""word  tag count
                  a     S    30
                  the   S    20
                  a     T    60
                  an    T    5
                  the   T    10""")

# The solution builds on the knowledge that sorting
# while grouping sorts within each group.
df[0, :, by('word'), sort(-f.count)]
wordtagcount
str32str32int32
0aT60
1anT5
2theS20
3 rows × 3 columns

Get the rows where the value column is minimum, and rename columns:

df = dt.Frame({"category": ["A"]*3 + ["B"]*3,
               "date": ["9/6/2016", "10/6/2016",
                        "11/6/2016", "9/7/2016",
                        "10/7/2016", "11/7/2016"],
               "value": [7,8,9,10,1,2]})

df
categorydatevalue
str32str32int32
0A9/6/20167
1A10/6/20168
2A11/6/20169
3B9/7/201610
4B10/7/20161
5B11/7/20162
6 rows × 3 columns
df[0,
   {"value_date": f.date,
    "value_min":  f.value},
  by("category"),
  sort('value')]
categoryvalue_datevalue_min
str32str32int32
0A9/6/20167
1B10/7/20161
2 rows × 3 columns

Get the rows where the value column is maximum, and rename columns:

df[0,
   {"value_date": f.date,
    "value_max":  f.value},
  by("category"),
  sort(-f.value)]
categoryvalue_datevalue_max
str32str32int32
0A11/6/20169
1B9/7/201610
2 rows × 3 columns

Get the average of the last three instances per group:

import random
random.seed(3)

df = dt.Frame({"Student": ["Bob", "Bill",
                           "Bob", "Bob",
                           "Bill","Joe",
                           "Joe", "Bill",
                           "Bob", "Joe",],
               "Score": random.sample(range(10,30), 10)})

df
StudentScore
str32int32
0Bob17
1Bill28
2Bob27
3Bob14
4Bill21
5Joe24
6Joe19
7Bill29
8Bob20
9Joe23
10 rows × 2 columns
df[-3:, mean(f[:]), f.Student]
StudentScore
str32float64
0Bill26
1Bob20.3333
2Joe22
3 rows × 2 columns

Group by on a condition¶

Get the sum of Amount for Number in range (1 to 4) and (5 and above):

df = dt.Frame("""Number, Amount
                    1,     5
                    2,     10
                    3,     11
                    4,     3
                    5,     5
                    6,     8
                    7,     9
                    8,     6""")

df[:, sum(f.Amount), by(ifelse(f.Number>=5, "B", "A"))]
C0Amount
str32int64
0A29
1B28
2 rows × 2 columns

	Fruit	Name	Number
	str32	str32	int64
0	Apples	Bob	16
1	Apples	Mike	9
2	Apples	Steve	10
3	Grapes	Bob	35
4	Grapes	Tom	87
5	Grapes	Tony	15
6	Oranges	Bob	67
7	Oranges	Mike	57
8	Oranges	Tom	15
9	Oranges	Tony	1

	Name	C0	Number
	str32	bool8	int64
0	Bob	0	102
1	Bob	1	16
2	Mike	0	57
3	Mike	1	9
4	Steve	1	10
5	Tom	0	102
6	Tony	0	16

	Fruit	Date	min	max
	str32	str32	int32	int32
0	Apples	10/6/2016	7	9
1	Apples	10/7/2016	1	10
2	Grapes	10/7/2016	1	87
3	Oranges	10/6/2016	15	65
4	Oranges	10/7/2016	1	2

	a	b	c	d	ratio
	int8	int8	int32	int32	float64
0	1	1	10	3	0.285714
1	1	0	5	1	1
2	0	0	1	2	0.0454545
3	1	1	5	1	0.142857
4	0	0	10	2	0.454545

	Student	Score
	str32	int32
0	Bob	17
1	Bill	28
2	Bob	27
3	Bob	14
4	Bill	21
5	Joe	24
6	Joe	19
7	Bill	29
8	Bob	20
9	Joe	23

	Number
	int64
0	35
1	137
2	140

	key1	data1
	str32	float64
0	a	0.093391
1	b	1.46819

	category	value_date	value_min
	str32	str32	int32
0	A	9/6/2016	7
1	B	10/7/2016	1

	category	value_date	value_max
	str32	str32	int32
0	A	11/6/2016	9
1	B	9/7/2016	10

Grouping with by()¶

Aggregation¶

Filtration¶

Transformation¶

Transform and Aggregate on multiple columns¶

Replicate R’s groupby mutate¶

Groupby on boolean expressions¶

Conditional sum with groupby¶

Conditional sums based on various criteria¶

More Examples¶

Aggregation on values in a column¶

Group-by and conditional sum and add back to data frame¶

Indexing with multiple min and max in one aggregate¶

Filter rows based on aggregate value¶

Group by on a condition¶

Grouping with `by()`¶

Replicate R’s groupby mutate ¶