datatable.by()

Group-by clause for use in Frame’s square-bracket selector.

Whenever a by() object is present inside a DT[i, j, ...] expression, it makes all other expressions to be evaluated in group-by mode. This mode causes the following changes to the evaluation semantics:

  • A “Groupby” object will be computed for the frame DT, grouping it by columns specified as the arguments to the by() call. This object keeps track of which rows of the frame belong to which group.

  • If an i expression is present (row filter), it will be interpreted within each group. For example, if i is a slice, then the slice will be applied separately to each group. Similarly, if i expression contains a formula with reduce functions, then those functions will be evaluated for each group. For example:

    DT[f.A == max(f.A), :, by(f.group_id)]

    will select those rows where column A reaches its peak value within each group (there could be multiple such rows within each group).

  • Before j is evaluated, the by() clause adds all its columns at the start of j (unless add_columns argument is False). If j is a “select-all” slice (i.e. :), then those columns will also be excluded from the list of all columns so that they will be present in the output only once.

  • During evaluation of j, the reducer functions, such as min(), sum(), etc, will be evaluated by-group, that is they will find the minimal value in each group, the sum of values in each group, and so on. If a reducer expression is combined with a regular column expression, then the reduced column will be auto-expanded into a column that is constant within each group.

  • Note that if both i and j contain reducer functions, then those functions will have a slightly different notion of groups: the reducers in i will see each group “in full”, whereas the reducers in j will see each group after it was filtered by the expression in i (and possibly not even see some of the groups at all, if they were filtered out completely).

  • If j contains only reducer expressions, then the final result will be a Frame containing just a single row for each group. This resulting frame will also be keyed by the grouped-by columns.

The by() function expects a single column or a sequence of columns as the argument(s). It accepts either a column name, or an f-expression. In particular, you can perform a group-by on a dynamically computed expression:

DT[:, :, by(dt.math.floor(f.A/100))]

The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values. As a special case, if you group by an expression -f.A, then it will be treated as if you requested to group by the column “A” sorting it in the descending order. This will work even with column types that are not arithmetic, for example “A” could be a string column here.

Examples

from datatable import dt, f, by df = dt.Frame({"group1": ["A", "A", "B", "B", "A"], "group2": [1, 0, 1, 1, 1], "var1": [343, 345, 567, 345, 212]}) df
group1group2var1
str32int8int32
0A1343
1A0345
2B1567
3B1345
4A1212

Group by a single column:

df[:, dt.count(), by("group1")]
group1count
str32int64
0A3
1B2

Group by multiple columns:

df[:, dt.sum(f.var1), by("group1", "group2")]
group1group2var1
str32int8int64
0A0345
1A1555
2B1912

Return grouping result without the grouping column(s) by setting the add_columns parameter to False:

df[:, dt.sum(f.var1), by("group1", "group2", add_columns=False)]
var1
int64
0345
1555
2912

f-expressions can be passed to by():

df[:, dt.count(), by(f.var1 < 400)]
C0count
bool8int64
001
114

By default, the groups are sorted in ascending order. The inverse is possible by negating the f-expressions in by():

df[:, dt.count(), by(-f.group1)]
group1count
str32int64
0B2
1A3

An integer can be passed to the i section:

df[0, :, by("group1")]
group1group2var1
str32int8int32
0A1343
1B1567

A slice is also acceptable within the i section:

df[-1:, :, by("group1")]
group1group2var1
str32int8int32
0A1212
1B1345

Note

f-expressions is not implemented yet for the i section in a groupby. Also, a sequence cannot be passed to the i section in the presence of by().

See Also