Selecting Data¶

Selecting Data – Columns¶

Column selection is via the j section in the DT[i, j, ...] syntax. First, let’s construct a simple Frame:

from datatable import dt, f
from datetime import date

source = {"dates" : [date(2000, 1, 5), date(2010, 11, 23), date(2020, 2, 29), None],
          "integers" : range(1, 5),
          "floats" : [10.0, 11.5, 12.3, -13],
          "strings" : ['A', 'B', None, 'D']
          }
DT = dt.Frame(source)
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns

Column selection is possible via a number of options:

By column name¶

DT[:, 'dates']
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
4 rows × 1 column

When selecting all rows, the i section can also be ....

By position¶

DT[..., 2]  # 3rd column
floats
float64
010
111.5
212.3
3-13
4 rows × 1 column

With position, you can select with a negative number – the column will be selected from the end; this is similar to indexing a python list:

DT[:, -2]  # 2nd column from the end
floats
float64
010
111.5
212.3
3-13
4 rows × 1 column

For a single column, it is possible to skip the : in the i section and pass the column name or position only

DT['dates']
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
4 rows × 1 column

DT[0]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
4 rows × 1 column

When selecting via column name or position, an error is returned if the name or position does not exist:

DT[:, 5]
ValueError: Column index 5 is invalid for a Frame with 4 columns
DT[:, 'categoricals']
KeyError: Column categoricals does not exist in the Frame

By data type¶

Column selection is possible by using python’s built-in types that correspond to one of the datatable’s types:

DT[:, int]
integers
int32
01
12
23
34
4 rows × 1 column

Or datatable’s Type:

DT[:, dt.Type.float64]
floats
float64
010
111.5
212.3
3-13
4 rows × 1 column
DT[:, dt.Type.date32]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
4 rows × 1 column

A list of types can be selected as well:

DT[:, [date, str]]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD
4 rows × 2 columns

By list¶

Using a list allows for selection of multiple columns:

DT[:, ['integers', 'strings']]
integersstrings
int32str32
01A
12B
23NA
34D
4 rows × 2 columns

A tuple of selectors is also allowed, although not recommended from stylistic perspective:

DT[:, (-3, 2, 3)]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4 rows × 3 columns

Selection via list comprehension/generator expression is possible:

DT[:, [num for num in range(DT.ncols) if num % 2 == 0]]
datesfloats
date32float64
02000-01-0510
12010-11-2311.5
22020-02-2912.3
3NA-13
4 rows × 2 columns

Selecting columns via a mix of column names and positions (integers) is not allowed:

DT[:, ['dates', 2]]
TypeError: Mixed selector types are not allowed. Element 1 is of type integer, whereas the previous element(s) were of type string

Via slicing¶

When slicing with strings, both the start and end column names are included in the returned frame:

DT[:, 'dates':'strings']
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns

However, when slicing via position, the columns are returned up to, but not including the final position; this is similar to the slicing pattern for Python’s sequences:

DT[:, 1:3]
integersfloats
int32float64
0110
1211.5
2312.3
34-13
4 rows × 2 columns
DT[:, ::-1]
stringsfloatsintegersdates
str32float64int32date32
0A1012000-01-05
1B11.522010-11-23
2NA12.332020-02-29
3D-134NA
4 rows × 4 columns

It is possible to select columns via slicing, even if the indices are not in the Frame:

DT[:, 3:10]  # there are only four columns in the Frame
strings
str32
0A
1B
2NA
3D
4 rows × 1 column

Unlike with integer slicing, providing a name of the column that is not in the Frame will result in an error:

DT[:, "integers" : "categoricals"]
KeyError: Column categoricals does not exist in the Frame

Slicing is also possible with the standard slice function:

DT[:, slice('integers', 'strings')]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4 rows × 3 columns

With the slice function, multiple slicing on the columns is possible:

DT[:, [slice("dates", "integers"), slice("floats", "strings")]]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns
DT[:, [slice("integers", "dates"), slice("strings", "floats")]]
integersdatesstringsfloats
int32date32str32float64
012000-01-05A10
122010-11-23B11.5
232020-02-29NA12.3
34NAD-13
4 rows × 4 columns

Slicing on strings can be combined with column names during selection:

DT[:, [slice("integers", "dates"), "strings"]]
integersdatesstrings
int32date32str32
012000-01-05A
122010-11-23B
232020-02-29NA
34NAD
4 rows × 3 columns

But not with integers:

DT[:, [slice("integers", "dates"), 1]]
TypeError: Mixed selector types are not allowed. Element 1 is of type integer, whereas the previous element(s) were of type string

Slicing on position can be combined with column position:

DT[:, [slice(1, 3), 0]]
integersfloatsdates
int32float64date32
01102000-01-05
1211.52010-11-23
2312.32020-02-29
34-13NA
4 rows × 3 columns

But not with strings:

DT[:, [slice(1, 3), "dates"]]
TypeError: Mixed selector types are not allowed. Element 1 is of type string, whereas the previous element(s) were of type integer

Via booleans¶

When selecting via booleans, the sequence length must be equal to the number of columns in the frame:

DT[:, [True, True, False, False]]
datesintegers
date32int32
02000-01-051
12010-11-232
22020-02-293
3NA4
4 rows × 2 columns

Booleans generated from a list comprehension/generator expression allow for nifty selections:

DT[:, ["i" in name for name in DT.names]]
integersstrings
int32str32
01A
12B
23NA
34D
4 rows × 2 columns

In this example we want to select columns that are numeric (integers or floats) and whose average is greater than 3:

DT[:, [column.type.is_numeric
       and column.mean1() > 3
       for column in DT]]
floats
float64
010
111.5
212.3
3-13
4 rows × 1 column

Via f-expressions ¶

All the selection options above (except boolean) are also possible via f-expressions:

DT[:, f.dates]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
4 rows × 1 column
DT[:, f[-1]]
strings
str32
0A
1B
2NA
3D
4 rows × 1 column
DT[:, f['integers':'strings']]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4 rows × 3 columns
DT[:, f['integers':]]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4 rows × 3 columns
DT[:, f[1::-1]]
integersdates
int32date32
012000-01-05
122010-11-23
232020-02-29
34NA
4 rows × 2 columns
DT[:, f[date, int, float]]
datesintegersfloats
date32int32float64
02000-01-05110
12010-11-23211.5
22020-02-29312.3
3NA4-13
4 rows × 3 columns
DT[:, f["dates":"integers", "floats":"strings"]]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns

Note

If the columns names are python keywords (def, del, …), the dot notation is not possible with f-expressions; you have to use the brackets notation to access these columns.

Note

Selecting columns with DT[:, f[None]] returns an empty Frame. This is different from DT[:, None], which currently returns all the columns. The behavior of DT[:, None] may change in the future:

DT[:, None]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
4 rows × 4 columns
DT[:, f[None]]
0
1
2
3
4 rows × 0 columns

Selecting Data – Rows¶

There are a number of ways to select rows of data via the i section.

Note

The index labels in a Frame are just for aesthetics; they serve no actual purpose during selection.

By Position¶

Only integer values are acceptable:

DT[0, :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
1 row × 4 columns
DT[-1, :]  # last row
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
1 row × 4 columns

Via Sequence of Positions¶

Any acceptable sequence of positions is applicable here. Listed below are some of these sequences.

List (tuple):

DT[[1, 2, 3], :]
dates integers floats strings
date32 int32 float64 str32
0 2010-11-23 2 11.5 B
1 2020-02-29 3 12.3 NA
2 NA 4 -13 D
3 rows × 4 columns
An integer numpy 1-D Array:

DT[np.arange(3), :]
dates integers floats strings
date32 int32 float64 str32
0 2000-01-05 1 10 A
1 2010-11-23 2 11.5 B
2 2020-02-29 3 12.3 NA
3 rows × 4 columns
A one column integer Frame:

DT[dt.Frame([1, 2, 3]), :]
dates integers floats strings
date32 int32 float64 str32
0 2010-11-23 2 11.5 B
1 2020-02-29 3 12.3 NA
2 NA 4 -13 D
3 rows × 4 columns
An integer pandas Series:

DT[pd.Series([1, 2, 3]), :]
dates integers floats strings
date32 int32 float64 str32
0 2010-11-23 2 11.5 B
1 2020-02-29 3 12.3 NA
2 NA 4 -13 D
3 rows × 4 columns
A python range:

DT[range(1, 3), :]
dates integers floats strings
date32 int32 float64 str32
0 2010-11-23 2 11.5 B
1 2020-02-29 3 12.3 NA
2 rows × 4 columns
A generator expression:

DT[(num for num in range(4)), :]
dates integers floats strings
date32 int32 float64 str32
0 2000-01-05 1 10 A
1 2010-11-23 2 11.5 B
2 2020-02-29 3 12.3 NA
3 NA 4 -13 D
4 rows × 4 columns

If the position passed to i does not exist, an error is raised

DT[(num for num in range(7)), :]
ValueError: Index 4 is invalid for a Frame with 4 rows

The set sequence is not acceptable in the i or j sections.

Except for lists/tuples, all the other sequence types passed into the i section can only contain positive integers.

Via booleans¶

When selecting rows via boolean sequence, the length of the sequence must be the same as the number of rows:

DT[[True, True, False, False], :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
2 rows × 4 columns
DT[(n%2 == 0 for n in range(DT.nrows)), :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12020-02-29312.3NA
2 rows × 4 columns

Via slicing¶

Slicing works similarly to slicing a python list:

DT[1:3, :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
2 rows × 4 columns
DT[::-1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
12020-02-29312.3NA
22010-11-23211.5B
32000-01-05110A
4 rows × 4 columns
DT[-1:-3:-1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
12020-02-29312.3NA
2 rows × 4 columns

Slicing is also possible with the slice function:

DT[slice(1, 3), :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
2 rows × 4 columns

It is possible to select rows with multiple slices. Let’s increase the number of rows in the Frame:

DT = dt.repeat(DT, 3)
DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
42000-01-05110A
52010-11-23211.5B
62020-02-29312.3NA
7NA4-13D
82000-01-05110A
92010-11-23211.5B
102020-02-29312.3NA
11NA4-13D
12 rows × 4 columns
DT[[slice(1, 3), slice(5, 8)], :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
22010-11-23211.5B
32020-02-29312.3NA
4NA4-13D
5 rows × 4 columns
DT[[slice(5, 8), 1, 3, slice(10, 12)], :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
2NA4-13D
32010-11-23211.5B
4NA4-13D
52020-02-29312.3NA
6NA4-13D
7 rows × 4 columns

Via f-expressions ¶

f-expressions return booleans that can be used to filter/select the appropriate rows:

DT[f.dates < dt.Frame([date(2020,1,1)]), :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
2 rows × 4 columns
DT[f.integers % 2 != 0, :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12020-02-29312.3NA
2 rows × 4 columns
DT[(f.integers == 3) & (f.strings == None), ...]
datesintegersfloatsstrings
date32int32float64str32
02020-02-29312.3NA
12020-02-29312.3NA
22020-02-29312.3NA
3 rows × 4 columns

Selection is possible via the data types:

DT[f[float] < 1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
1NA4-13D
2NA4-13D
3 rows × 4 columns
DT[dt.rowsum(f[int, float]) > 12, :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
22010-11-23211.5B
32020-02-29312.3NA
42010-11-23211.5B
52020-02-29312.3NA
6 rows × 4 columns

Select rows and columns¶

Specific selections can occur in rows and columns simultaneously:

DT[0, slice(1, 3)]
integersfloats
int32float64
0110
1 row × 2 columns
DT[2 : 6, ["i" in name for name in DT.names]]
integersstrings
int32str32
03NA
14D
21A
32B
4 rows × 2 columns
DT[f.integers > dt.mean(f.floats) - 3, f['strings' : 'integers']]
stringsfloatsintegers
str32float64int32
0NA12.33
1D-134
2NA12.33
3D-134
4NA12.33
5D-134
6 rows × 3 columns

Single value access¶

Passing single integers into the i and j sections returns a scalar value:

DT[0, 0]
datetime.date(2000, 1, 5)

DT[0, 2]
10.0

DT[-3, 'strings']
'B'

Deselect rows/columns¶

Deselection of rows/columns is possible via list comprehension/generator expression

Deselect a single column/row:

# The list comprehension returns the specific column names DT[:, [name for name in DT.names if name != "integers"]]
dates floats strings
date32 float64 str32
0 2000-01-05 10 A
1 2010-11-23 11.5 B
2 2020-02-29 12.3 NA
3 NA -13 D
4 2000-01-05 10 A
5 2010-11-23 11.5 B
6 2020-02-29 12.3 NA
7 NA -13 D
8 2000-01-05 10 A
9 2010-11-23 11.5 B
10 2020-02-29 12.3 NA
11 NA -13 D
12 rows × 3 columns

# A boolean sequence is returned in the list comprehension DT[[num != 5 for num in range(DT.nrows)], 'dates']
dates
date32
0 2000-01-05
1 2010-11-23
2 2020-02-29
3 NA
4 2000-01-05
5 2020-02-29
6 NA
7 2000-01-05
8 2010-11-23
9 2020-02-29
10 NA
11 rows × 1 column
Deselect multiple columns/rows:

DT[:, [name not in ("integers", "dates") for name in DT.names]]
floats strings
float64 str32
0 10 A
1 11.5 B
2 12.3 NA
3 -13 D
4 10 A
5 11.5 B
6 12.3 NA
7 -13 D
8 10 A
9 11.5 B
10 12.3 NA
11 -13 D
12 rows × 2 columns

DT[(num not in range(3, 8) for num in range(DT.nrows)), ['integers', 'floats']]
integers floats
int32 float64
0 1 10
1 2 11.5
2 3 12.3
3 1 10
4 2 11.5
5 3 12.3
6 4 -13
7 rows × 2 columns

DT[:, [num not in (2, 3) for num in range(DT.ncols)]]
dates integers
date32 int32
0 2000-01-05 1
1 2010-11-23 2
2 2020-02-29 3
3 NA 4
4 2000-01-05 1
5 2010-11-23 2
6 2020-02-29 3
7 NA 4
8 2000-01-05 1
9 2010-11-23 2
10 2020-02-29 3
11 NA 4
12 rows × 2 columns

# an alternative to the previous example DT[:, [num not in (2, 3) for num, _ in enumerate(DT.names)]]
dates integers
date32 int32
0 2000-01-05 1
1 2010-11-23 2
2 2020-02-29 3
3 NA 4
4 2000-01-05 1
5 2010-11-23 2
6 2020-02-29 3
7 NA 4
8 2000-01-05 1
9 2010-11-23 2
10 2020-02-29 3
11 NA 4
12 rows × 2 columns
Deselect by data type:

# This selects columns that are not numeric DT[2:7, [not coltype.is_numeric for coltype in DT.types]]
dates strings
date32 str32
0 2020-02-29 NA
1 NA D
2 2000-01-05 A
3 2010-11-23 B
4 2020-02-29 NA
5 rows × 2 columns

Slicing could be used to exclude rows/columns. The code below excludes rows from position 3 to 6:

DT[[slice(None, 3), slice(7, None)], :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
42000-01-05110A
52010-11-23211.5B
62020-02-29312.3NA
7NA4-13D
8 rows × 4 columns

Columns can also be deselected via the remove() method, where the column name, column position, or data type is passed to the f symbol:

DT[:, f[:].remove(f.dates)]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4110A
5211.5B
6312.3NA
74-13D
8110A
9211.5B
10312.3NA
114-13D
12 rows × 3 columns
DT[:, f[:].remove(f[0])]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4110A
5211.5B
6312.3NA
74-13D
8110A
9211.5B
10312.3NA
114-13D
12 rows × 3 columns
DT[:, f[:].remove(f[1:3])]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD
42000-01-05A
52010-11-23B
62020-02-29NA
7NAD
82000-01-05A
92010-11-23B
102020-02-29NA
11NAD
12 rows × 2 columns
DT[:, f[:].remove(f['strings':'integers'])]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
42000-01-05
52010-11-23
62020-02-29
7NA
82000-01-05
92010-11-23
102020-02-29
11NA
12 rows × 1 column
DT[:, f[:].remove(f[int, float])]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD
42000-01-05A
52010-11-23B
62020-02-29NA
7NAD
82000-01-05A
92010-11-23B
102020-02-29NA
11NAD
12 rows × 2 columns
DT[:, f[:].remove(f[:])]
0
1
2
3
4
5
6
7
8
9
10
11
12 rows × 0 columns

Delete rows/columns¶

To actually delete a row (or a column), use the del statement; this is an in-place operation, and as such no reassignment is needed

Delete multiple rows:

del DT[3:7, :] DT
dates integers floats strings
date32 int32 float64 str32
0 2000-01-05 1 10 A
1 2010-11-23 2 11.5 B
2 2020-02-29 3 12.3 NA
3 NA 4 -13 D
4 2000-01-05 1 10 A
5 2010-11-23 2 11.5 B
6 2020-02-29 3 12.3 NA
7 NA 4 -13 D
8 rows × 4 columns
Delete a single row:

del DT[3, :] DT
dates integers floats
date32 int32 float64
0 2000-01-05 1 10
1 2010-11-23 2 11.5
2 2020-02-29 NA NA
3 2000-01-05 NA NA
4 2010-11-23 2 11.5
5 2020-02-29 3 12.3
6 NA 4 -13
7 rows × 3 columns
Delete a column:

del DT['strings'] DT
dates integers floats
date32 int32 float64
0 2000-01-05 1 10
1 2010-11-23 2 11.5
2 2020-02-29 3 12.3
3 NA 4 -13
4 2000-01-05 1 10
5 2010-11-23 2 11.5
6 2020-02-29 3 12.3
7 NA 4 -13
8 rows × 3 columns
Delete multiple columns:

del DT[:, ['dates', 'floats']] DT
integers
int32
0 1
1 2
2 NA
3 NA
4 2
5 3
6 4
7 rows × 1 column

	dates	integers	floats	strings
	date32	int32	float64	str32
0	2000-01-05	1	10	A
1	2010-11-23	2	11.5	B
2	2020-02-29	3	12.3	NA
3	NA	4	-13	D

Selecting Data¶

Selecting Data – Columns¶

By column name¶

By position¶

By data type¶

By list¶

Via slicing¶

Via booleans¶

Via f-expressions¶

Selecting Data – Rows¶

By Position¶

Via Sequence of Positions¶

Via booleans¶

Via slicing¶

Via f-expressions¶

Select rows and columns¶

Single value access¶

Deselect rows/columns¶

Delete rows/columns¶

Via f-expressions ¶

Via f-expressions ¶