The main method for accessing data and computing on the frame. Sometimes
we also refer to it as the
DT[i, j, ...] call.
Since Python does not support keyword arguments inside square brackets,
all arguments are positional. The first is the row selector
second is the column selector
j, and the rest are optional. Thus,
DT[i, j] selects rows
i and columns
j from frame
If an additional
by argument is present, then the selectors
j work within groups generated by the
by() expression. The
argument reorders the rows of the frame, and the
join argument allows
performing SQL joins between several frames.
The signature listed here is the most generic. But there are also
DT[i, j] described below.
The row selector.
If this is an integer or a slice, then the behavior is the same as in
Python when working on a list with
elements. In particular, the integer value must be within the range
[-nrows; nrows). On the other hand when
i is a slice, then either
its start or end or both may be safely outside the row-range of the
frame. The trivial slice
: always selects all rows.
i may also be a single-column boolean Frame. It must have the
same number of rows as the current frame, and it serves as a mask for
which rows are to be selected:
True indicates that the row should
be included in the result, while
None skips the row.
i may also be a single-column integer Frame. Such column specifies
directly which row indices are to be selected. This is more flexible
compared to a boolean column: the rows may be repeated, reordered,
omitted, etc. All values in the column
i must be in the range
[0; nrows) or an error will be thrown. In particular, negative
indices are not allowed. Also, if the column contains NA values, then
it would produce an “invalid row”, i.e. a row filled with NAs.
i may also be an expression, which must evaluate into a single
column, either boolean or integer. In this case the result is the
same as described above for a single-column frame.
i can be a list of any of the above (integers, slices, frames,
expressions, etc), in which case each element of the list is evaluated
separately and then all selected rows are put together. The list may
Nones, which will be simply skipped.
This argument may either select columns, or perform computations with the columns.
Select a single column by name. A
dt.exceptions.KeyErroris raised if the column with such a name does not exist.
This is a trivial slice, and it means “select everything”, and is roughly equivalent to SQL’s
*. In the simple case of
DT[i, j]call “selecting everything” means all columns from frame
DT. However, when the
by()clause is added, then
:will now select all columns except those used in the groupby. And if the expression has a
join(), then “selecting everything” will produce all columns from all frames, excluding those that were duplicate during a natural join.
An integer slice can be used to select a subset of columns. The behavior of a slice is exactly the same as in base Python.
A string slice is an expression like
"colA":"colZ". In this case all columns from
"colZ"inclusive are selected. And if
"colA” in the frame, then the returned columns will be in the reverse order.
Both endpoints of the slice must be valid columns (or omitted), or otherwise a
dt.exceptions.KeyErrorwill be raised.
Select only columns of the matching type.
An expression formula is computed within the current evaluation context (i.e. it takes into account the current frame, the filter
i, the presence of groupby/join parameters, etc). The result of this evaluation is used as-if that colum existed in the frame.
jis a list of boolean values, then it must have the length of
.ncols, and it describes which columns are to be selected into the result.
jcan also be a list of elements of any other type listed above, with the only restriction that the items must be homogeneous. For example, you can mix
slice[int]s, but not
Each item in the list will be evaluated separately (as if each was the sole element in
j), and then all the results will be put together.
A dictionary can be used to select columns/expressions similarly to a list, but assigning them explicit names.
As a special case, the
jargument may be the
update()function, which turns the selection operation into an update. That is, instead of returning the chosen rows/columns, they will be updated instead with the user-supplied values.
by() clause is present in the square brackets, the rest of the
computations are carried out within the “context of a groupby”. This
should generally be equivalent to (a) splitting the frame into separate
sub-frames corresponding to each group, (b) applying
separately within each group, (c) row-binding the results for each
group. In practice the following operations are affected:
all reduction operators such as
dt.sum()now work separately within each group. Thus, instead of computing sum over the entire column, it is computed separately within each group in
by(), and the resulting column will have as many rows as the number of groups.
iexpressions are re-interpreted as being applied within each group. For example, if
iis an integer or a slice, then it will now be selecting row(s) within each group.
certain functions (such as
dt.shift()) are also “group-aware”, and produce results that take into account the groupby context. Check documentation for each individual function to find out whether it has special treatment for groupby contexts.
by() also affects the order of columns in the output
frame. Specifically, all columns listed as the groupby keys will be
automatically placed at the front of the resulting frame, and also
This argument can be used to rearrange rows in the resulting frame.
sort() for details.
Performs a JOIN operation with another frame. The
join() clause will calculate how the rows
of the current frame match against the rows of the joined frame, and
allow you to refer to the columns of the joined frame within
by. In order to access columns of the joined frame use
This parameter may be listed multiple times if you need to join with
several frames. However, for the moment,
g. can only access
columns of the first joined frame.
The order of evaluation of expressions is that first the
are computed, creating a mapping between the rows of the current frame and
the joined frame(s). After that we evaluate
sort. Next, the
filter is applied creating the final index of rows that will be selected.
Lastly, we evaluate the
j part, taking into account the current groupby
and row index(es).
j, it is essentially converted into a tree (DAG) of
expressions, where each expression is evaluated from the bottom up. That
is, we start evaluating from the leaf nodes (which are usually column
selectors such as
f), and then at each convert the set of columns
into a new set. Importantly, each subexpression node may produce columns
of 3 types: “scalar”, “grouped”, and “full-size”. Whenever subexpressions
of different levels are mixed together, they are upgraded to the highest
level. Thus, a scalar may be reused for each group, and a grouped column
can interoperate with a regular column by auto-expanding in such a way
that it becomes constant within each group.
If, after the
j is fully evaluated, it produces a column set of type
“grouped”, then the resulting frame will have as many rows as there are
groups. If, on the other hand, the column set is “full-size”, then the
resulting frame will have as many rows as the original frame.
Extract a single column
j from the frame.
The single-argument version of
DT[i, j] works only for
either an integer (indicating column index) or a string (column name).
If you need any other way of addressing column(s) of the frame, use the
DT[:, j] form.
The index or name of a column to retrieve.
Single-column frame containing the column at the specified index or with the given name.