datatable.by()

Group-by clause for use in Frame’s square-bracket selector.

Whenever a by() object is present inside a DT[i, j, ...] expression, it makes all other expressions to be evaluated in group-by mode. This mode causes the following changes to the evaluation semantics:

  • A “Groupby” object will be computed for the frame DT, grouping it by columns specified as the arguments to the by() call. This object keeps track of which rows of the frame belong to which group.

  • If an i expression is present (row filter), it will be interpreted within each group. For example, if i is a slice, then the slice will be applied separately to each group. Similarly, if i expression contains a formula with reduce functions, then those functions will be evaluated for each group. For example:

    DT[f.A == max(f.A), :, by(f.group_id)]
    

    will select those rows where column A reaches its peak value within each group (there could be multiple such rows within each group).

  • Before j is evaluated, the by() clause adds all its columns at the start of j (unless add_columns argument is False). If j is a “select-all” slice (i.e. :), then those columns will also be excluded from the list of all columns so that they will be present in the output only once.

  • During evaluation of j, the reducer functions, such as min(), sum(), etc, will be evaluated by-group, that is they will find the minimal value in each group, the sum of values in each group, and so on. If a reducer expression is combined with a regular column expression, then the reduced column will be auto-expanded into a column that is constant within each group.

  • Note that if both i and j contain reducer functions, then those functions will have slightly different notion of groups: the reducers in i will see each group “in full”, whereas the reducers in j will see each group after it was filtered by the expression in i (and possibly not even see some of the groups at all, if they were filtered out completely).

  • If j contains only reducer expressions, then the final result will be a Frame containing containing just a single row for each group. This resulting frame will also be keyed by the grouped-by columns.

The by() function expects a single column or a sequence of columns as the argument(s). It accepts either a column name, or an f-expression. In particular, you can perform a group-by on a dynamically computed expression:

DT[:, :, by(dt.math.floor(f.A/100))]

The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values. As a special case, if you group by an expression -f.A, then it will be treated as if you requested to group by the column “A” sorting it in the descending order. This will work even with column types that are not arithmetic, for example “A” could be a string column here.