Grouping with by
¶
The by()
modifier splits a dataframe into groups, either via the provided column(s) or f-expressions, and then applies i
and j
within each group. This split-apply-combine strategy allows for a number of operations:
Aggregations per group,
Transformation of a column or columns, where the shape of the dataframe is maintained,
Filtration, where some data are kept and the others discarded, based on a condition or conditions.
Aggregation¶
The aggregate function is applied in the j
section.
Group by one column
from datatable import (dt, f, by, ifelse, update, sort,
count, min, max, mean, sum, rowsum)
df = dt.Frame("""Fruit Date Name Number
Apples 10/6/2016 Bob 7
Apples 10/6/2016 Bob 8
Apples 10/6/2016 Mike 9
Apples 10/7/2016 Steve 10
Apples 10/7/2016 Bob 1
Oranges 10/7/2016 Bob 2
Oranges 10/6/2016 Tom 15
Oranges 10/6/2016 Mike 57
Oranges 10/6/2016 Bob 65
Oranges 10/7/2016 Tony 1
Grapes 10/7/2016 Bob 1
Grapes 10/7/2016 Tom 87
Grapes 10/7/2016 Bob 22
Grapes 10/7/2016 Bob 12
Grapes 10/7/2016 Tony 15""")
df[:, sum(f.Number), by('Fruit')]
Fruit Number
0 Apples 35
1 Grapes 137
2 Oranges 140
Group by multiple columns
df[:, sum(f.Number), by('Fruit', 'Name')]
Fruit Name Number
0 Apples Bob 16
1 Apples Mike 9
2 Apples Steve 10
3 Grapes Bob 35
4 Grapes Tom 87
5 Grapes Tony 15
6 Oranges Bob 67
7 Oranges Mike 57
8 Oranges Tom 15
9 Oranges Tony 1
By column position
df[:, sum(f.Number), by(f[0])]
Fruit Number
0 Apples 35
1 Grapes 137
2 Oranges 140
By boolean expression
df[:, sum(f.Number), by(f.Fruit == "Apples")]
C0 Number
0 0 277
1 35
Combination of column and boolean expression
df[:, sum(f.Number), by(f.Name, f.Fruit == "Apples")]
Name C0 Number
0 Bob 0 102
1 Bob 1 16
2 Mike 0 57
3 Mike 1 9
4 Steve 1 10
5 Tom 0 102
6 Tony 0 16
The grouping column can be excluded from the final output
df[:, sum(f.Number), by('Fruit', add_columns=False)]
Number
0 35
1 137
2 140
- Note:
The resulting dataframe has the grouping column(s) as the first column(s).
The grouping columns are excluded from
j
, unless explicitly included.The grouping columns are sorted in ascending order.
Apply multiple aggregate functions to a column in the
j
section
df[:, {"min": min(f.Number),
"max": max(f.Number)},
by('Fruit','Date')]
Fruit Date min max
0 Apples 10/6/2016 7 9
1 Apples 10/7/2016 1 10
2 Grapes 10/7/2016 1 87
3 Oranges 10/6/2016 15 65
4 Oranges 10/7/2016 1 2
Functions can be applied across a columnset
Task : Get sum of
col3
andcol4
, grouped bycol1
andcol2
df = dt.Frame(""" col1 col2 col3 col4 col5
a c 1 2 f
a c 1 2 f
a d 1 2 f
b d 1 2 g
b e 1 2 g
b e 1 2 g""")
df[:, sum(f["col3":"col4"]), by('col1', 'col2')]
col1 col2 col3 col4
0 a c 2 4
1 a d 1 2
2 b d 1 2
3 b e 2 4
Apply different aggregate functions to different columns
df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]
col1 col2 col3 col4
0 a c 1 2
1 a d 1 2
2 b d 1 2
3 b e 1 2
Nested aggregations in
j
Task : Group by column
idx
and get the row sum ofA
andB
,C
andD
df = dt.Frame(""" idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z""")
df[:,
{"AB" : sum(rowsum(f['A':'B'])),
"CD" : sum(rowsum(f['C':'D']))},
by('cat')
]
cat AB CD
0 x 12 12
1 y 18 19
2 z 24 26
Computation between aggregated columns
Task : Get the difference between the largest and smallest value within each group
df = dt.Frame("""GROUP VALUE
1 5
2 2
1 10
2 20
1 7""")
df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]
GROUP C0
0 1 5
1 2 18
Null values are not excluded from the grouping column
df = dt.Frame(""" a b c
1 2.0 3
1 NaN 4
2 1.0 3
1 2.0 2""")
df[:, sum(f[:]), by('b')]
b a c
0 NA 1 4
1 1 2 3
2 2 2 5
If you wish to ignore null values, first filter them out
df[f.b != None, :][:, sum(f[:]), by('b')]
b a c
0 1 2 3
1 2 2 5
Filtration¶
This occurs in the i
section of the groupby, where only a subset of the data per group is needed; selection is limited to integers or slicing.
- Note:
i
is applied after the grouping, not before.f-expressions in the
i
section is not yet implemented for groupby.
Select the first row per group
df = dt.Frame("""A B
1 10
1 20
2 30
2 40
3 10""")
# passing 0 as index gets the first row after the grouping
# note that python's index starts from 0, not 1
df[0, :, by('A')]
A B
0 1 10
1 2 30
2 3 10
Select the last row per group
df[-1, :, by('A')]
A B
0 1 20
1 2 40
2 3 10
Select the nth row per group
Task : select the second row per group
df[1, :, by('A')]
A B
0 1 20
1 2 40
- Note:
Filtering this way can be used to drop duplicates; you can decide to keep the first or last non-duplicate.
Select the latest entry per group
df = dt.Frame("""id product date
220 6647 2014-09-01
220 6647 2014-09-03
220 6647 2014-10-16
826 3380 2014-11-11
826 3380 2014-12-09
826 3380 2015-05-19
901 4555 2014-09-01
901 4555 2014-10-05
901 4555 2014-11-01""")
df[-1, :, by('id'), sort('date')]
id product date
0 220 6647 2014-10-16
1 826 3380 2015-05-19
2 901 4555 2014-11-01
- Note:
-If
sort
andby
modifiers are present, the sorting occurs after the grouping, and occurs within each group.
Replicate
SQL
’sHAVING
clauseTask: Filter for groups where the length/count is greater than 1
df = dt.Frame([[1, 1, 5], [2, 3, 6]], names=['A', 'B'])
df
A B
0 1 2
1 1 3
2 5 6
# Get the count of each group,
# and assign to a new column, using the update method
# note that the update operation is in-place;
# there is no need to assign back to the dataframe
df[:, update(filter_col = count()), by('A')]
# The new column will be added to the end
# We use an f-expression to return rows
# in each group where the count is greater than 1
df[f.filter_col > 1, f[:-1]]
A B
0 1 2
1 1 3
Keep only rows per group where
diff
is the minimum
df = dt.Frame(""" item diff otherstuff
1 2 1
1 1 2
1 3 7
2 -1 0
2 1 3
2 4 9
2 -6 2
3 0 0
3 2 9""")
df[:,
#get boolean for rows where diff column is minimum for each group
update(filter_col = f.diff == min(f.diff)),
by('item')]
df[f.filter_col == 1, :-1]
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
Keep only entries where
make
has both 0 and 1 insales
df = dt.Frame(""" make country other_columns sale
honda tokyo data 1
honda hirosima data 0
toyota tokyo data 1
toyota hirosima data 0
suzuki tokyo data 0
suzuki hirosima data 0
ferrari tokyo data 1
ferrari hirosima data 0
nissan tokyo data 1
nissan hirosima data 0""")
df[:,
update(filter_col = sum(f.sale)),
by('make')]
df[f.filter_col == 1, :-1]
make country other_columns sale
0 honda tokyo data 1
1 honda hirosima data 0
2 toyota tokyo data 1
3 toyota hirosima data 0
4 ferrari tokyo data 1
5 ferrari hirosima data 0
6 nissan tokyo data 1
7 nissan hirosima data 0
Transformation¶
This is when a function is applied to a column after a groupby and the resulting column is appended back to the dataframe. The number of rows of the dataframe is unchanged. The update()
method makes this possible and easy. Let’s look at a couple of examples:
Get the minimum and maximum of column
c
per group, and append to dataframe
df = dt.Frame(""" c y
9 0
8 0
3 1
6 2
1 3
2 3
5 3
4 4
0 4
7 4""")
# Assign the new columns via the update method
df[:,
update(min_col = min(f.c),
max_col = max(f.c)),
by('y')]
df
c y min_col max_col
0 9 0 8 9
1 8 0 8 9
2 3 1 3 3
3 6 2 6 6
4 1 3 1 5
5 2 3 1 5
6 5 3 1 5
7 4 4 0 7
8 0 4 0 7
9 7 4 0 7
Fill missing values by group mean
df = dt.Frame({'value' : [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'name' : ['A','A', 'B','B','B','B', 'C','C','C']})
df
value name
0 1 A
1 NA A
2 NA B
3 2 B
4 3 B
5 1 B
6 3 C
7 NA C
8 3 C
# This uses a combination of update and ifelse methods:
df[:,
update(value = ifelse(f.value == None,
mean(f.value),
f.value)),
by('name')]
df
value name
0 1 A
1 1 A
2 2 B
3 2 B
4 3 B
5 1 B
6 3 C
7 3 C
8 3 C
Transform and Aggregate on Multiple Columns
Task: Get the sum of the aggregate of column
a
andb
, grouped byc
andd
and append to dataframe.
df = dt.Frame({'a' : [1,2,3,4,5,6],
'b' : [1,2,3,4,5,6],
'c' : ['q', 'q', 'q', 'q', 'w', 'w'],
'd' : ['z','z','z','o','o','o']})
df
a b c d
0 1 1 q z
1 2 2 q z
2 3 3 q z
3 4 4 q o
4 5 5 w o
5 6 6 w o
df[:,
update(e = sum(f.a) + sum(f.b)),
by('c', 'd')
]
df
a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22
Replicate R’s groupby mutate
Task : Get ratio by dividing column
c
by the product of columnc
andd
, grouped bya
andb
df = dt.Frame(dict(a = (1,1,0,1,0),
b = (1,0,0,1,0),
c = (10,5,1,5,10),
d = (3,1,2,1,2))
)
df
a b c d
0 1 1 10 3
1 1 0 5 1
2 0 0 1 2
3 1 1 5 1
4 0 0 10 2
df[:,
update(ratio = f.c / sum(f.c * f.d)),
by('a', 'b')
]
df
a b c d ratio
0 1 1 10 3 0.285714
1 1 0 5 1 1
2 0 0 1 2 0.0454545
3 1 1 5 1 0.142857
4 0 0 10 2 0.454545
Groupby on Boolean Expressions¶
Conditional Sum with groupby
Task : Sum
data1
column, grouped bykey1
and rows wherekey2== "one"
df = dt.Frame("""data1 data2 key1 key2
0.361601 0.375297 a one
0.069889 0.809772 a two
1.468194 0.272929 b one
-1.138458 0.865060 b two
-0.268210 1.250340 a one""")
df[:,
sum(f.data1),
by(f.key2 == "one", f.key1)][f.C0 == 1, 1:]
key1 data1
0 a 0.093391
1 b 1.46819
Conditional Sums based on various Criteria
df = dt.Frame(""" A_id B C
a1 "up" 100
a2 "down" 102
a3 "up" 100
a3 "up" 250
a4 "left" 100
a5 "right" 102""")
df[:,
{"sum_up": sum(f.B == "up"),
"sum_down" : sum(f.B == "down"),
"over_200_up" : sum((f.B == "up") & (f.C > 200))
},
by('A_id')]
A_id sum_up sum_down over_200_up
0 a1 1 0 0
1 a2 0 1 0
2 a3 2 0 1
3 a4 0 0 0
4 a5 0 0 0
More Examples¶
Aggregation on Values in a Column
Task : group by
Day
and find minimumData_Value
forTMIN
and maximumData_Value
forTMAX
df = dt.Frame(""" Day Element Data_Value
01-01 TMAX 112
01-01 TMAX 101
01-01 TMIN 60
01-01 TMIN 0
01-01 TMIN 25
01-01 TMAX 113
01-01 TMAX 115
01-01 TMAX 105
01-01 TMAX 111
01-01 TMIN 44
01-01 TMIN 83
01-02 TMAX 70
01-02 TMAX 79
01-02 TMIN 0
01-02 TMIN 60
01-02 TMAX 73
01-02 TMIN 31
01-02 TMIN 26
01-02 TMAX 71
01-02 TMIN 26""")
df[:,
f.Day.extend({"TMAX" : max(ifelse(f.Element=="TMAX",
f.Data_Value, None)),
"TMIN" : min(ifelse(f.Element=="TMIN",
f.Data_Value, None)}))
]
Day TMAX TMIN
0 01-01 115 0
1 01-02 79 0
Group By and Conditional Sum and add Back to Data Frame
Task: Sum the
Count
value for eachID
, whenNum
is (17 or 12) andLetter
is ‘D’ and also add the calculation back to the original data frame as ‘Total’
df = dt.Frame(""" ID Num Letter Count
1 17 D 1
1 12 D 2
1 13 D 3
2 17 D 4
2 12 A 5
2 16 D 1
3 16 D 1""")
expression = ((f.Num==17) | (f.Num==12)) & (f.Letter == "D")
df[:,
update(Total = sum(ifelse(expression, f.Count, 0))),
by('ID')]
df
ID Num Letter Count Total
0 1 17 D 1 3
1 1 12 D 2 3
2 1 13 D 3 3
3 2 17 D 4 4
4 2 12 A 5 4
5 2 16 D 1 4
6 3 16 D 1 0
Multiple indexing with multiple min and max in one aggregate
Task : find
col1
wherecol2
is max,col2
wherecol3
is min andcol1
wherecol3
is max
df = dt.Frame({
"id" : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
"col1" : [1, 3, 5, 2, 5, 3, 6, 3, 67, 7],
"col2" : [4, 6, 8, 3, 65, 3, 5, 4, 4, 7],
"col3" : [34, 64, 53, 5, 6, 2, 4, 6, 4, 67],
})
df
id col1 col2 col3
0 1 1 4 34
1 1 3 6 64
2 1 5 8 53
3 2 2 3 5
4 2 5 65 6
5 2 3 3 2
6 2 6 5 4
7 3 3 4 6
8 3 67 4 4
9 3 7 7 67
df[:,
{'col1' : max(ifelse(f.col2 == max(f.col2),
f.col1, None)),
'col2' : max(ifelse(f.col3 == min(f.col3),
f.col2, None)),
'col3' : max(ifelse(f.col3 == max(f.col3),
f.col1, None))
},
by('id')]
id col1 col2 col3
0 1 5 4 3
1 2 5 3 5
2 3 7 4 7
Filter row based on aggregate value
Task : Find, for every
word
, thetag
that has the mostcount
df = dt.Frame("""word tag count
a S 30
the S 20
a T 60
an T 5
the T 10""")
# The solution builds on the knowledge that sorting
# while grouping sorts within each group.
df[0, :, by('word'), sort(-f.count)]
word tag count
0 a T 60
1 an T 5
2 the S 20
Get the rows where the
value
column is minimum, and rename columns
df = dt.Frame({"category": ["A"]*3 + ["B"]*3,
"date": ["9/6/2016", "10/6/2016",
"11/6/2016", "9/7/2016",
"10/7/2016", "11/7/2016"],
"value": [7,8,9,10,1,2]})
df
category date value
0 A 9/6/2016 7
1 A 10/6/2016 8
2 A 11/6/2016 9
3 B 9/7/2016 10
4 B 10/7/2016 1
5 B 11/7/2016 2
df[0,
{"value_date": f.date,
"value_min": f.value},
by("category"),
sort('value')]
category value_date value_min
0 A 9/6/2016 7
1 B 10/7/2016 1
Using the same data in the last example, get the rows where the
value
column is maximum, and rename columns
df[0,
{"value_date": f.date,
"value_max": f.value},
by("category"),
sort(-f.value)]
category value_date value_max
0 A 11/6/2016 9
1 B 9/7/2016 10
Get the average of the last three instances per group
import random
random.seed(3)
df = dt.Frame({"Student": ["Bob", "Bill",
"Bob", "Bob",
"Bill","Joe",
"Joe", "Bill",
"Bob", "Joe",],
"Score": random.sample(range(10,30), 10)})
df
Student Score
0 Bob 17
1 Bill 28
2 Bob 27
3 Bob 14
4 Bill 21
5 Joe 24
6 Joe 19
7 Bill 29
8 Bob 20
9 Joe 23
df[-3:, mean(f[:]), f.Student]
Student Score
0 Bill 26
1 Bob 20.3333
2 Joe 22
Group by on a condition
Get the sum of
Amount
forNumber
in range (1 to 4) and (5 and above)
df = dt.Frame("""Number, Amount
1, 5
2, 10
3, 11
4, 3
5, 5
6, 8
7, 9
8, 6""")
df[:, sum(f.Amount), by(ifelse(f.Number>=5, "B","A"))]
C0 Amount
0 A 29
1 B 28