Selecting Data

Selecting Data – Columns

Column selection is via the j section in the DT[i, j, ...] syntax. First, let’s construct a simple Frame:

from datatable import dt, f from datetime import date source = {"dates" : [date(2000, 1, 5), date(2010, 11, 23), date(2020, 2, 29), None], "integers" : range(1, 5), "floats" : [10.0, 11.5, 12.3, -13], "strings" : ['A', 'B', None, 'D'] } DT = dt.Frame(source) DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

Column selection is possible via a number of options:

By column name

DT[:, 'dates']
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA

When selecting all rows, the i section can also be ....

By position

DT[..., 2] # 3rd column
floats
float64
010
111.5
212.3
3-13

With position, you can select with a negative number – the column will be selected from the end; this is similar to indexing a python list:

DT[:, -2] # 2nd column from the end
floats
float64
010
111.5
212.3
3-13

For a single column, it is possible to skip the : in the i section and pass the column name or position only

DT['dates']
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
DT[0]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA

When selecting via column name or position, an error is returned if the name or position does not exist:

DT[:, 5]
ValueError: Column index 5 is invalid for a Frame with 4 columns
DT[:, 'categoricals']
KeyError: Column categoricals does not exist in the Frame

By data type

Column selection is possible by using python’s built-in types that correspond to one of the datatable’s types:

DT[:, int]
integers
int32
01
12
23
34

Or datatable’s Type:

DT[:, dt.Type.float64]
floats
float64
010
111.5
212.3
3-13
DT[:, dt.Type.date32]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA

A list of types can be selected as well:

DT[:, [date, str]]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD

By list

Using a list allows for selection of multiple columns:

DT[:, ['integers', 'strings']]
integersstrings
int32str32
01A
12B
23NA
34D

A tuple of selectors is also allowed, although not recommended from stylistic perspective:

DT[:, (-3, 2, 3)]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D

Selection via list comprehension/generator expression is possible:

DT[:, [num for num in range(DT.ncols) if num % 2 == 0]]
datesfloats
date32float64
02000-01-0510
12010-11-2311.5
22020-02-2912.3
3NA-13

Selecting columns via a mix of column names and positions (integers) is not allowed:

DT[:, ['dates', 2]]
TypeError: Mixed selector types are not allowed. Element 1 is of type integer, whereas the previous element(s) were of type string

Via slicing

When slicing with strings, both the start and end column names are included in the returned frame:

DT[:, 'dates':'strings']
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

However, when slicing via position, the columns are returned up to, but not including the final position; this is similar to the slicing pattern for Python’s sequences:

DT[:, 1:3]
integersfloats
int32float64
0110
1211.5
2312.3
34-13
DT[:, ::-1]
stringsfloatsintegersdates
str32float64int32date32
0A1012000-01-05
1B11.522010-11-23
2NA12.332020-02-29
3D-134NA

It is possible to select columns via slicing, even if the indices are not in the Frame:

DT[:, 3:10] # there are only four columns in the Frame
strings
str32
0A
1B
2NA
3D

Unlike with integer slicing, providing a name of the column that is not in the Frame will result in an error:

DT[:, "integers" : "categoricals"]
KeyError: Column categoricals does not exist in the Frame

Slicing is also possible with the standard slice function:

DT[:, slice('integers', 'strings')]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D

With the slice function, multiple slicing on the columns is possible:

DT[:, [slice("dates", "integers"), slice("floats", "strings")]]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
DT[:, [slice("integers", "dates"), slice("strings", "floats")]]
integersdatesstringsfloats
int32date32str32float64
012000-01-05A10
122010-11-23B11.5
232020-02-29NA12.3
34NAD-13

Slicing on strings can be combined with column names during selection:

DT[:, [slice("integers", "dates"), "strings"]]
integersdatesstrings
int32date32str32
012000-01-05A
122010-11-23B
232020-02-29NA
34NAD

But not with integers:

DT[:, [slice("integers", "dates"), 1]]
TypeError: Mixed selector types are not allowed. Element 1 is of type integer, whereas the previous element(s) were of type string

Slicing on position can be combined with column position:

DT[:, [slice(1, 3), 0]]
integersfloatsdates
int32float64date32
01102000-01-05
1211.52010-11-23
2312.32020-02-29
34-13NA

But not with strings:

DT[:, [slice(1, 3), "dates"]]
TypeError: Mixed selector types are not allowed. Element 1 is of type string, whereas the previous element(s) were of type integer

Via booleans

When selecting via booleans, the sequence length must be equal to the number of columns in the frame:

DT[:, [True, True, False, False]]
datesintegers
date32int32
02000-01-051
12010-11-232
22020-02-293
3NA4

Booleans generated from a list comprehension/generator expression allow for nifty selections:

DT[:, ["i" in name for name in DT.names]]
integersstrings
int32str32
01A
12B
23NA
34D

In this example we want to select columns that are numeric (integers or floats) and whose average is greater than 3:

DT[:, [column.type.is_numeric and column.mean1() > 3 for column in DT]]
floats
float64
010
111.5
212.3
3-13

Via f-expressions

All the selection options above (except boolean) are also possible via f-expressions:

DT[:, f.dates]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
DT[:, f[-1]]
strings
str32
0A
1B
2NA
3D
DT[:, f['integers':'strings']]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
DT[:, f['integers':]]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
DT[:, f[1::-1]]
integersdates
int32date32
012000-01-05
122010-11-23
232020-02-29
34NA
DT[:, f[date, int, float]]
datesintegersfloats
date32int32float64
02000-01-05110
12010-11-23211.5
22020-02-29312.3
3NA4-13
DT[:, f["dates":"integers", "floats":"strings"]]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

Note

If the columns names are python keywords (def, del, …), the dot notation is not possible with f-expressions; you have to use the brackets notation to access these columns.

Note

Selecting columns with DT[:, f[None]] returns an empty Frame. This is different from DT[:, None], which currently returns all the columns. The behavior of DT[:, None] may change in the future:

DT[:, None]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
DT[:, f[None]]
0
1
2
3

Selecting Data – Rows

There are a number of ways to select rows of data via the i section.

Note

The index labels in a Frame are just for aesthetics; they serve no actual purpose during selection.

By Position

Only integer values are acceptable:

DT[0, :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
DT[-1, :] # last row
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D

Via Sequence of Positions

Any acceptable sequence of positions is applicable here. Listed below are some of these sequences.

  • List (tuple):

    DT[[1, 2, 3], :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
    2NA4-13D
  • An integer numpy 1-D Array:

    DT[np.arange(3), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02000-01-05110A
    12010-11-23211.5B
    22020-02-29312.3NA
  • A one column integer Frame:

    DT[dt.Frame([1, 2, 3]), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
    2NA4-13D
  • An integer pandas Series:

    DT[pd.Series([1, 2, 3]), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
    2NA4-13D
  • A python range:

    DT[range(1, 3), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
  • A generator expression:

    DT[(num for num in range(4)), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02000-01-05110A
    12010-11-23211.5B
    22020-02-29312.3NA
    3NA4-13D

If the position passed to i does not exist, an error is raised

DT[(num for num in range(7)), :]
ValueError: Index 4 is invalid for a Frame with 4 rows

The set sequence is not acceptable in the i or j sections.

Except for lists/tuples, all the other sequence types passed into the i section can only contain positive integers.

Via booleans

When selecting rows via boolean sequence, the length of the sequence must be the same as the number of rows:

DT[[True, True, False, False], :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
DT[(n%2 == 0 for n in range(DT.nrows)), :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12020-02-29312.3NA

Via slicing

Slicing works similarly to slicing a python list:

DT[1:3, :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
DT[::-1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
12020-02-29312.3NA
22010-11-23211.5B
32000-01-05110A
DT[-1:-3:-1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
12020-02-29312.3NA

Slicing is also possible with the slice function:

DT[slice(1, 3), :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA

It is possible to select rows with multiple slices. Let’s increase the number of rows in the Frame:

DT = dt.repeat(DT, 3) DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
42000-01-05110A
52010-11-23211.5B
62020-02-29312.3NA
7NA4-13D
82000-01-05110A
92010-11-23211.5B
102020-02-29312.3NA
11NA4-13D
DT[[slice(1, 3), slice(5, 8)], :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
22010-11-23211.5B
32020-02-29312.3NA
4NA4-13D
DT[[slice(5, 8), 1, 3, slice(10, 12)], :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
2NA4-13D
32010-11-23211.5B
4NA4-13D
52020-02-29312.3NA
6NA4-13D

Via f-expressions

f-expressions return booleans that can be used to filter/select the appropriate rows:

DT[f.dates < dt.Frame([date(2020,1,1)]), :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
DT[f.integers % 2 != 0, :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12020-02-29312.3NA
DT[(f.integers == 3) & (f.strings == None), ...]
datesintegersfloatsstrings
date32int32float64str32
02020-02-29312.3NA
12020-02-29312.3NA
22020-02-29312.3NA

Selection is possible via the data types:

DT[f[float] < 1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
1NA4-13D
2NA4-13D
DT[dt.rowsum(f[int, float]) > 12, :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
22010-11-23211.5B
32020-02-29312.3NA
42010-11-23211.5B
52020-02-29312.3NA

Select rows and columns

Specific selections can occur in rows and columns simultaneously:

DT[0, slice(1, 3)]
integersfloats
int32float64
0110
DT[2 : 6, ["i" in name for name in DT.names]]
integersstrings
int32str32
03NA
14D
21A
32B
DT[f.integers > dt.mean(f.floats) - 3, f['strings' : 'integers']]
stringsfloatsintegers
str32float64int32
0NA12.33
1D-134
2NA12.33
3D-134
4NA12.33
5D-134

Single value access

Passing single integers into the i and j sections returns a scalar value:

DT[0, 0]
datetime.date(2000, 1, 5)
DT[0, 2]
10.0
DT[-3, 'strings']
'B'

Deselect rows/columns

Deselection of rows/columns is possible via list comprehension/generator expression

  • Deselect a single column/row:

    # The list comprehension returns the specific column names DT[:, [name for name in DT.names if name != "integers"]]
    datesfloatsstrings
    date32float64str32
    02000-01-0510A
    12010-11-2311.5B
    22020-02-2912.3NA
    3NA-13D
    42000-01-0510A
    52010-11-2311.5B
    62020-02-2912.3NA
    7NA-13D
    82000-01-0510A
    92010-11-2311.5B
    102020-02-2912.3NA
    11NA-13D
    # A boolean sequence is returned in the list comprehension DT[[num != 5 for num in range(DT.nrows)], 'dates']
    dates
    date32
    02000-01-05
    12010-11-23
    22020-02-29
    3NA
    42000-01-05
    52020-02-29
    6NA
    72000-01-05
    82010-11-23
    92020-02-29
    10NA
  • Deselect multiple columns/rows:

    DT[:, [name not in ("integers", "dates") for name in DT.names]]
    floatsstrings
    float64str32
    010A
    111.5B
    212.3NA
    3-13D
    410A
    511.5B
    612.3NA
    7-13D
    810A
    911.5B
    1012.3NA
    11-13D
    DT[(num not in range(3, 8) for num in range(DT.nrows)), ['integers', 'floats']]
    integersfloats
    int32float64
    0110
    1211.5
    2312.3
    3110
    4211.5
    5312.3
    64-13
    DT[:, [num not in (2, 3) for num in range(DT.ncols)]]
    datesintegers
    date32int32
    02000-01-051
    12010-11-232
    22020-02-293
    3NA4
    42000-01-051
    52010-11-232
    62020-02-293
    7NA4
    82000-01-051
    92010-11-232
    102020-02-293
    11NA4
    # an alternative to the previous example DT[:, [num not in (2, 3) for num, _ in enumerate(DT.names)]]
    datesintegers
    date32int32
    02000-01-051
    12010-11-232
    22020-02-293
    3NA4
    42000-01-051
    52010-11-232
    62020-02-293
    7NA4
    82000-01-051
    92010-11-232
    102020-02-293
    11NA4
  • Deselect by data type:

    # This selects columns that are not numeric DT[2:7, [not coltype.is_numeric for coltype in DT.types]]
    datesstrings
    date32str32
    02020-02-29NA
    1NAD
    22000-01-05A
    32010-11-23B
    42020-02-29NA

Slicing could be used to exclude rows/columns. The code below excludes rows from position 3 to 6:

DT[[slice(None, 3), slice(7, None)], :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
42000-01-05110A
52010-11-23211.5B
62020-02-29312.3NA
7NA4-13D

Columns can also be deselected via the remove() method, where the column name, column position, or data type is passed to the f symbol:

DT[:, f[:].remove(f.dates)]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4110A
5211.5B
6312.3NA
74-13D
8110A
9211.5B
10312.3NA
114-13D
DT[:, f[:].remove(f[0])]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4110A
5211.5B
6312.3NA
74-13D
8110A
9211.5B
10312.3NA
114-13D
DT[:, f[:].remove(f[1:3])]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD
42000-01-05A
52010-11-23B
62020-02-29NA
7NAD
82000-01-05A
92010-11-23B
102020-02-29NA
11NAD
DT[:, f[:].remove(f['strings':'integers'])]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
42000-01-05
52010-11-23
62020-02-29
7NA
82000-01-05
92010-11-23
102020-02-29
11NA
DT[:, f[:].remove(f[int, float])]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD
42000-01-05A
52010-11-23B
62020-02-29NA
7NAD
82000-01-05A
92010-11-23B
102020-02-29NA
11NAD
DT[:, f[:].remove(f[:])]
0
1
2
3
4
5
6
7
8
9
10
11

Delete rows/columns

To actually delete a row (or a column), use the del statement; this is an in-place operation, and as such no reassignment is needed

  • Delete multiple rows:

    del DT[3:7, :] DT
    datesintegersfloatsstrings
    date32int32float64str32
    02000-01-05110A
    12010-11-23211.5B
    22020-02-29312.3NA
    3NA4-13D
    42000-01-05110A
    52010-11-23211.5B
    62020-02-29312.3NA
    7NA4-13D
  • Delete a single row:

    del DT[3, :] DT
    datesintegersfloats
    date32int32float64
    02000-01-05110
    12010-11-23211.5
    22020-02-29NANA
    32000-01-05NANA
    42010-11-23211.5
    52020-02-29312.3
    6NA4-13
  • Delete a column:

    del DT['strings'] DT
    datesintegersfloats
    date32int32float64
    02000-01-05110
    12010-11-23211.5
    22020-02-29312.3
    3NA4-13
    42000-01-05110
    52010-11-23211.5
    62020-02-29312.3
    7NA4-13
  • Delete multiple columns:

    del DT[:, ['dates', 'floats']] DT
    integers
    int32
    01
    12
    2NA
    3NA
    42
    53
    64