
Introduction¶
Data is everywhere. From the smallest photon interactions to galaxy collisions, from mouse movements on a screen to economic developments of countries, we are surrounded by the sea of information. The human mind cannot comprehend this data in all its complexity; since ancient times people found it much easier to reduce the dimensionality, to impose a strict order, to arrange the data points neatly on a rectangular grid: to make a data table.
But once the data has been collected into a table, it has been tamed. It may still need some grooming and exercise, essentially so it is no longer scary. Even if it is really Big Data, with the right tools you can approach it, play with it, bend it to your will, master it.
Python datatable
module is the right tool for the task. It is a library that
implements a wide (and growing) range of operators for manipulating
two-dimensional data frames. It focuses on: big data support, high performance,
both in-memory and out-of-memory datasets, and multi-threaded algorithms. In
addition, datatable
strives to achieve good user experience, helpful error
messages, and powerful API similar to R data.table
’s.
Getting Started¶
Using datatable¶
This section describes common functionality and commands that you can run in datatable
.
Create Frame¶
You can create a Frame from a variety of sources, including numpy
arrays,
pandas
DataFrames, raw Python objects, etc:
import datatable as dt
import numpy as np
np.random.seed(1)
dt.Frame(np.random.randn(1000000))
C0 | |
---|---|
▪▪▪▪▪▪▪▪ | |
0 | 1.62435 |
1 | −0.611756 |
2 | −0.528172 |
3 | −1.07297 |
4 | 0.865408 |
5 | −2.30154 |
6 | 1.74481 |
7 | −0.761207 |
8 | 0.319039 |
9 | −0.24937 |
⋮ | ⋮ |
999,995 | 0.0595784 |
999,996 | 0.140349 |
999,997 | −0.596161 |
999,998 | 1.18604 |
999,999 | 0.313398 |
import pandas as pd
pf = pd.DataFrame({"A": range(1000)})
dt.Frame(pf)
A | |
---|---|
▪▪▪▪▪▪▪▪ | |
0 | 0 |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 6 |
7 | 7 |
8 | 8 |
9 | 9 |
⋮ | ⋮ |
995 | 995 |
996 | 996 |
997 | 997 |
998 | 998 |
999 | 999 |
dt.Frame({"n": [1, 3], "s": ["foo", "bar"]})
n | s | |
---|---|---|
▪ | ▪▪▪▪ | |
0 | 1 | foo |
1 | 3 | bar |
Convert a Frame¶
Convert an existing Frame into a numpy
array, a pandas
DataFrame,
or a pure Python object:
nparr = DT.to_numpy()
pddfr = DT.to_pandas()
pyobj = DT.to_list()
Parse Text (csv) Files¶
datatable
provides fast and convenient parsing of text (csv) files:
DT = dt.fread("train.csv")
The datatable
parser
Automatically detects separators, headers, column types, quoting rules, etc.
Reads from file, URL, shell, raw text, archives, glob
Provides multi-threaded file reading for maximum speed
Includes a progress indicator when reading large files
Reads both RFC4180-compliant and non-compliant files
Write the Frame¶
Write the Frame’s content into a csv
file (also multi-threaded):
DT.to_csv("out.csv")
Save a Frame¶
Save a Frame into a binary format on disk, then open it later instantly, regardless of the data size:
DT.to_jay("out.jay")
DT2 = dt.open("out.jay")
Basic Frame Properties¶
Basic Frame properties include:
print(DT.shape) # (nrows, ncols)
print(DT.names) # column names
print(DT.stypes) # column types
Compute Per-Column Summary Stats¶
Compute per-column summary stats using:
DT.sum()
DT.max()
DT.min()
DT.mean()
DT.sd()
DT.mode()
DT.nmodal()
DT.nunique()
Select Subsets of Rows/Columns¶
Select subsets of rows and/or columns using:
DT[:, "A"] # select 1 column
DT[:10, :] # first 10 rows
DT[::-1, "A":"D"] # reverse rows order, columns from A to D
DT[27, 3] # single element in row 27, column 3 (0-based)
Delete Rows/Columns¶
Delete rows and or columns using:
del DT[:, "D"] # delete column D
del DT[f.A < 0, :] # delete rows where column A has negative values
Filter Rows¶
Filter rows via an expression using the following. In this example, mean
,
sd
, f
are all symbols imported from datatable
:
DT[(f.x > mean(f.y) + 2.5 * sd(f.y)) | (f.x < -mean(f.y) - sd(f.y)), :]
Compute Columnar Expressions¶
Compute columnar expressions using:
DT[:, {"x": f.x, "y": f.y, "x+y": f.x + f.y, "x-y": f.x - f.y}]
Append Rows/Columns¶
Append rows/columns to a Frame using Frame.cbind()
:
DT1.cbind(DT2, DT3)
DT1.rbind(DT4, force=True)
User Guide¶
f
-expressions¶
The datatable
module exports a special symbol f
, which can be used
to refer to the columns of a frame currently being operated on. If this sounds
cryptic, consider that the most common way to operate on a frame is via the
square-bracket call DT[i, j, by, ...]
. And it is often the case that within
this expression you would want to refer to individual columns of the frame:
either to create a filter, or a transform, or specify a grouping variable, etc.
In all such cases the f
symbol is used, and it is considered to be
evaluated within the context of the frame DT
.
For example, consider the expression:
f.price
By itself, it just means a column named “price”, in an unspecified frame. This expression becomes concrete, however, when used with a particular frame. For example:
train_dt[f.price > 0, :]
selects all rows in train_dt
where the price is positive. Thus, within the
call to train_dt[...]
, the symbol f
refers to the frame train_dt
.
The standalone f-expression may occasionally be useful too: it can be saved in
a variable and then re-applied to several different frames. Each time f
will refer to the frame to which it is being applied:
price_filter = (f.price > 0)
train_filtered = train_dt[price_filter, :]
test_filtered = test_dt[price_filter, :]
The simple expression f.price
can be saved in a variable too. In fact,
there is a Frame helper method .export_names()
which does exactly that:
returns a tuple of variables for each column name in the frame, allowing you to
omit the f.
prefix:
Id, Price, Quantity = DT.export_names()
DT[:, [Id, Price, Quantity, Price * Quantity]]
Single-column selector¶
As you have seen, the expression f.NAME
refers to a column called “NAME”.
This notation is handy, but not universal. What do you do if the column’s name
contains spaces or unicode characters? Or if a column’s name is not known, only
its index? Or if the name is in a variable? For these purposes f
supports
the square-bracket selectors:
f[-1] # select the last column
f["Price ($)"] # select column names "Price ($)"
Generally, f[i]
means either the column at index i
if i
is an
integer, or the column with name i
if i
is a string.
Using an integer index follows the standard Python rule for list subscripts:
negative indices are interpreted as counting from the end of the frame, and
requesting a column with an index outside of [-ncols; ncols)
will raise
an error.
This square-bracket form is also useful when you want to access a column
dynamically, i.e. if its name is not known in advance. For example, suppose
there is a frame with columns "2017_01"
, "2017_02"
, …, "2019_12"
.
Then all these columns can be addressed as:
[f["%d_%02d" % (year, month)]
for month in range(1, 13)
for year in [2017, 2018, 2019]]
Multi-column selector¶
In the previous section you have seen that f[i]
refers to a single column
when i
is either an integer or a string. However we alo support the case
when i
is a slice or a type:
f[:] # select all columns
f[::-1] # select all columns in reverse order
f[:5] # select the first 5 columns
f[3:4] # select the fourth column
f["B":"H"] # select columns from B to H, inclusive
f[int] # select all integer columns
f[float] # select all floating-point columns
f[dt.str32] # select all columns with stype `str32`
f[None] # select no columns (empty columnset)
In all these cases a columnset is returned. This columnset may contain a variable number of columns or even no columns at all, depending on the frame to which this f-expression is applied.
Applying a slice to symbol f
follows the same semantics as if f
was a
list of columns. Thus f[:10]
means the first 10 columns of a frame, or all
columns if the frame has less than 10. Similarly, f[9:10]
selects the 10th
column of a frame if it exists, or nothing if the frame has less than 10
columns. Compare this to selector f[9]
, which also selects the 10th column
of a frame if it exists, but throws an exception if it doesn’t.
Besides the usual numeric ranges, you can also use name ranges. These ranges
include the first named column, the last named column, and all columns in
between. It is not possible to mix positional and named columns in a range,
and it is not possible to specify a step. If the range is x:y
, yet column
x
comes after y
in the frame, then the columns will be selected in the
reverse order: first x
, then the column preceding x
, and so on, until
column y
is selected last:
f["C1":"C9"] # Select columns from C1 up to C9
f["C9":"C1"] # Select columns C9, C8, C7, ..., C2, C1
f[:"C3"] # Select all columns up to C3
f["C5":] # Select all columns after C5
Finally, you can select all columns of a particular type by using that type
as an f-selector. You can pass either common python types bool
, int
,
float
, str
; or you can pass an stype such as dt.int32
, or an ltype such as
dt.ltype.obj
. You can also pass None to not select any columns. By itself
this may not be very useful, but occasionally you may need this as a fallback
in conditional expressions:
f[int if select_types == "integer" else
float if select_types == "floating" else
None] # otherwise don't select any columns
A columnset can be used in situations where a sequence of columns is expected, such as:
the
j
node ofDT[i,j,...]
;within
by()
andsort()
functions;with certain functions that operate on sequences of columns:
rowsum()
,rowmean
,rowmin
, etc;many other functions that normally operate on a single column will automatically map over all columns in columnset:
sum(f[:]) # equivalent to [sum(f[i]) for i in range(DT.ncols)] f[:3] + f[-3:] # same as [f[0]+f[-3], f[1]+f[-2], f[2]+f[-1]]
New in version 0.10.0.
Modifying a columnset¶
Columnsets support operations that either add or remove elements from the set.
This is done using methods .extend()
and .remove()
.
The .extend()
method takes a columnset as an argument (also a list, or dict,
or sequence of columns) and produces a new columnset containing both the
original and the new columns. The columns need not be unique: the same column
may appear multiple times in a columnset. This method allows to add transformed
columns into the columnset as well:
f[int].extend(f[float]) # integer and floating-point columns
f[:3].extend(f[-3:]) # the first and the last 3 columns
f.A.extend(f.B) # columns "A" and "B"
f[str].extend(dt.str32(f[int])) # string columns, and also all integer
# columns converted to strings
# All columns, and then one additional column named 'cost', which contains
# column `price` multiplied by `quantity`:
f[:].extend({"cost": f.price * f.quantity})
When a columnset is extended, the order of the elements is preserved. Thus, a columnset is closer in functionality to a python list than to a set. In addition, some of the elements in a columnset can have names, if the columnset is created from a dictionary. The names may be non-unique too.
The .remove()
method is the opposite of .extend()
: it takes an existing
columnset and then removes all columns that are passed as the argument:
f[:].remove(f[str]) # all columns except columns of type string
f[:10].remove(f.A) # the first 10 columns without column "A"
f[:].remove(f[3:-3]) # same as `f[:3].extend(f[-3:])`, at least in the
# context of a frame with 6+ columns
Removing a column that is not in the columnset is not considered an error,
similar to how set-difference operates. Thus, f[:].remove(f.A)
may be
safely applied to a frame that doesn’t have column “A”: the columns that cannot
be removed are simply ignored.
If a columnset includes some column several times, and then you request to
remove that column, then only the first occurrence in the sequence will be
removed. Generally, the multiplicity of some column “A” in columnset
cs1.remove(cs2)
will be equal the multiplicity of “A” in cs1
minus the
multiplicity of “A” in cs2
, or 0 if such difference would be negative.
Thus,:
f[:].extend(f[int]).remove(f[int])
will have the effect of moving all integer columns to the end of the columnset
(since .remove()
removes the first occurrence of a column it finds).
It is not possible to remove a transformed column from a columnset. An error
will be thrown if the argument of .remove()
contains any transformed
columns.
New in version 0.10.0.
Fread Examples¶
This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.
See fread()
for all the available parameters.
Note: If you wish to read in multiple files, use iread()
; it returns an iterator of Frames.
Read Data¶
Read from text file:
from datatable import dt, fread result = fread('iris.csv') result.head(5) sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5 3.6 1.4 0.2 setosa
Read text data directly:
data = ('col1,col2,col3\n' 'a,b,1\n' 'a,b,2\n' 'c,d,3') fread(data) col1 col2 col3 a b 1 a b 2 a b 3
Read from a url:
url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv" fread(url)
Read from an archive:
If there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in:
fread("data.zip/mtcars.csv")
Note: Use iread()
if you wish to read in multiple files in an archive; an iterator of Frames is returned.
Read from
.xls
or.xlsx
filesfread("excel.xlsx")
For excel files, you can specify the sheet to be read:
fread("excel.xlsx/Sheet1")
- Note:
Read in data from the command line. Simply pass the command line statement to the
cmd
parameter:#https://blog.jpalardy.com/posts/awk-tutorial-part-2/ #You specify the `cmd` parameter #Here we filter data for the year 2015 fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")
The command line can be very handy with large data; you can do some of the preprocessing before reading in the data to datatable
.
Detect Thousand Separator¶
Fread
handles thousand separator, with the assumption that the separator is a ,
:
data = """Name|Salary|Position
James|256,000|evangelist
Ragnar|1,000,000|conqueror
Loki|250360|trickster"""
fread(data)
Name Salary Position
0 James 256000 evangelist
1 Ragnar 1000000 conqueror
2 Loki 250360 trickster
Specify the Delimiter¶
You can specify the delimiter via the sep
parameter.
Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]
:
data = """
1:2:3:4
5:6:7:8
9:10:11:12
"""
fread(data, sep=":")
C0 C1 C2 C3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Dealing with Null Values and Blank Rows¶
You can pass a list of values to be treated as null, via the na_strings
parameter:
data = """
ID|Charges|Payment_Method
634-VHG|28|Cheque
365-DQC|33.5|Credit card
264-PPR|631|--
845-AJO|42.3|
789-KPO|56.9|Bank Transfer
"""
fread(data, na_strings=['--', ''])
ID Charges Payment_Method
0 634-VHG 28 Cheque
1 365-DQC 33.5 Credit card
2 264-PPR 631 NA
3 845-AJO 42.3 NA
4 789-KPO 56.9 Bank Transfer
For rows with less values than in other rows, you can set fill=True
; fread
will fill with NA
:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11')
fread(data, fill=True)
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NA
You can skip empty lines:
data = ('a,b,c,d\n'
'\n'
'1,2,3,4\n'
'5,6,7,8\n'
'\n'
'9,10,11,12')
fread(data, skip_blank_lines=True)
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Dealing with Column Names¶
If the data has no headers, fread
will assign default column names:
data = ('1,2\n'
'3,4\n')
fread(data)
C0 C1
0 1 2
1 3 4
You can pass in column names via the columns
parameter:
fread(data, columns=['A','B'])
A B
0 1 2
1 3 4
You can change column names:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, columns=["A","B","C","D"])
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
You can change some of the column names via a dictionary:
fread(data, columns={"a":"A", "b":"B"})
A B c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Fread
uses heuristics to determine whether the first row is data or not; occasionally it may guess incorrectly, in which case, you can set the header
parameter to False:
fread(data, header=False)
C0 C1 C2 C3
0 a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
You can pass a new list of column names as well:
fread(data, header=False, columns=["A","B","C","D"])
A B C D
0 a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
Row Selection¶
Fread
has a skip_to_line
parameter, where you can specify what line to read the data from:
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2)
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
You can also skip to a line containing a particular string with the skip_to_string
parameter, and start reading data from that line. Note that skip_to_string
and skip_to_line
cannot be combined; you can only use one:
data = ('skip this line\n'
'a,b,c,d\n'
'first, second, third, last\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_string='first')
first second third last
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
You can set the maximum number of rows to read with the max_nrows
parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, max_nrows=2)
a b c d
0 1 2 3 4
1 5 6 7 8
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2, max_nrows=2)
a b c d
0 1 2 3 4
1 5 6 7 8
Setting Column Type¶
You can determine the data types via the columns
parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
#this is useful when you are interested in only a subset of the columns
fread(data, columns={"a":dt.float32, "b":dt.str32})
You can also pass in the data types by position:
fread(data, columns = (stype.int32, stype.str32, stype.float32))
You can also change all the column data types with a single assignment:
fread(data, columns = dt.float32)
You can change the data type for a slice of the columns:
#this changes the data type to float for the first three columns
fread(data, columns={float:slice(3)})
Note that there are a small number of stypes within datatable
(int8, int16, int32, int64, float32, float64, str32, str64, obj64, bool8)
Selecting Columns¶
There are various ways to select columns in fread
:
Select with a dictionary:
data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') #pass ``Ellipsis : None`` or ``... : None``, #to discard any columns that are not needed fread(data, columns={"a":"a", ... : None}) a 0 1 1 5 2 9
Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.
Select columns with a set:
fread(data, columns={"a","b"}) a b 0 1 2 1 5 6 2 9 10
Select range of columns with slice:
#select the second and third column fread(data, columns=slice(1,3)) b c 0 2 3 1 6 7 2 10 11 #select the first column #jump two hoops and #select the third column fread(data, columns = slice(None,3,2)) a c 0 1 3 1 5 7 2 9 11
Select range of columns with range:
fread(data, columns = range(1,3)) b c 0 2 3 1 6 7 2 10 11
Boolean Selection:
fread(data, columns=[False, False, True, True]) c d 0 3 4 1 7 8 2 11 12
Select with a list comprehension:
fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols]) a c 0 1 3 1 5 7 2 9 11
Exclude columns with None:
fread(data, columns = ['a',None,None,'d']) a d 0 1 4 1 5 8 2 9 12
Exclude columns with list comprehension:
fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols]) b d 0 2 4 1 6 8 2 10 12
Drop columns by assigning None to the columns via a dictionary:
data = ("A,B,C,D\n" "1,3,5,7\n" "2,4,6,8\n") fread(data, columns={"B":None,"D":None}) A C 0 1 5 1 2 6
Drop a column and change data type:
fread(data, columns={"B":None, "C":str}) A C D 0 1 5 7 1 2 6 8
Change column name and type, and drop a column:
#pass a tuple, where the first item in the tuple is the new column name, #and the other item is the new data type. fread(data, columns={"A":("first", float), "B":None,"D":None}) first C 0 1 5 1 2 6
With list comprehensions, you can dynamically select columns:
#select columns that have length, and species column
fread('iris.csv',
#use a boolean list comprehension to get the required columns
columns = lambda cols : [(col.name=='species')
or ("length" in col.name)
for col in cols],
max_nrows=5)
sepal_length petal_length species
0 5.1 1.4 setosa
1 4.9 1.4 setosa
2 4.7 1.3 setosa
3 4.6 1.5 setosa
4 5 1.4 setosa
#select columns by position
fread('iris.csv',
columns = lambda cols : [ind in (1,4) for ind, col in enumerate(cols)],
max_nrows=5)
sepal_length petal_length petal_width
0 5.1 1.4 0.2
1 4.9 1.4 0.2
2 4.7 1.3 0.2
3 4.6 1.5 0.2
4 5 1.4 0.2
Grouping with by
¶
The by()
modifier splits a dataframe into groups, either via the provided column(s) or f-expressions, and then applies i
and j
within each group. This split-apply-combine strategy allows for a number of operations:
Aggregations per group,
Transformation of a column or columns, where the shape of the dataframe is maintained,
Filtration, where some data are kept and the others discarded, based on a condition or conditions.
Aggregation¶
The aggregate function is applied in the j
section.
Group by one column
from datatable import (dt, f, by, ifelse, update, sort,
count, min, max, mean, sum, rowsum)
df = dt.Frame("""Fruit Date Name Number
Apples 10/6/2016 Bob 7
Apples 10/6/2016 Bob 8
Apples 10/6/2016 Mike 9
Apples 10/7/2016 Steve 10
Apples 10/7/2016 Bob 1
Oranges 10/7/2016 Bob 2
Oranges 10/6/2016 Tom 15
Oranges 10/6/2016 Mike 57
Oranges 10/6/2016 Bob 65
Oranges 10/7/2016 Tony 1
Grapes 10/7/2016 Bob 1
Grapes 10/7/2016 Tom 87
Grapes 10/7/2016 Bob 22
Grapes 10/7/2016 Bob 12
Grapes 10/7/2016 Tony 15""")
df[:, sum(f.Number), by('Fruit')]
Fruit Number
0 Apples 35
1 Grapes 137
2 Oranges 140
Group by multiple columns
df[:, sum(f.Number), by('Fruit', 'Name')]
Fruit Name Number
0 Apples Bob 16
1 Apples Mike 9
2 Apples Steve 10
3 Grapes Bob 35
4 Grapes Tom 87
5 Grapes Tony 15
6 Oranges Bob 67
7 Oranges Mike 57
8 Oranges Tom 15
9 Oranges Tony 1
By column position
df[:, sum(f.Number), by(f[0])]
Fruit Number
0 Apples 35
1 Grapes 137
2 Oranges 140
By boolean expression
df[:, sum(f.Number), by(f.Fruit == "Apples")]
C0 Number
0 0 277
1 35
Combination of column and boolean expression
df[:, sum(f.Number), by(f.Name, f.Fruit == "Apples")]
Name C0 Number
0 Bob 0 102
1 Bob 1 16
2 Mike 0 57
3 Mike 1 9
4 Steve 1 10
5 Tom 0 102
6 Tony 0 16
The grouping column can be excluded from the final output
df[:, sum(f.Number), by('Fruit', add_columns=False)]
Number
0 35
1 137
2 140
- Note:
The resulting dataframe has the grouping column(s) as the first column(s).
The grouping columns are excluded from
j
, unless explicitly included.The grouping columns are sorted in ascending order.
Apply multiple aggregate functions to a column in the
j
section
df[:, {"min": min(f.Number),
"max": max(f.Number)},
by('Fruit','Date')]
Fruit Date min max
0 Apples 10/6/2016 7 9
1 Apples 10/7/2016 1 10
2 Grapes 10/7/2016 1 87
3 Oranges 10/6/2016 15 65
4 Oranges 10/7/2016 1 2
Functions can be applied across a columnset
Task : Get sum of
col3
andcol4
, grouped bycol1
andcol2
df = dt.Frame(""" col1 col2 col3 col4 col5
a c 1 2 f
a c 1 2 f
a d 1 2 f
b d 1 2 g
b e 1 2 g
b e 1 2 g""")
df[:, sum(f["col3":"col4"]), by('col1', 'col2')]
col1 col2 col3 col4
0 a c 2 4
1 a d 1 2
2 b d 1 2
3 b e 2 4
Apply different aggregate functions to different columns
df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]
col1 col2 col3 col4
0 a c 1 2
1 a d 1 2
2 b d 1 2
3 b e 1 2
Nested aggregations in
j
Task : Group by column
idx
and get the row sum ofA
andB
,C
andD
df = dt.Frame(""" idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z""")
df[:,
{"AB" : sum(rowsum(f['A':'B'])),
"CD" : sum(rowsum(f['C':'D']))},
by('cat')
]
cat AB CD
0 x 12 12
1 y 18 19
2 z 24 26
Computation between aggregated columns
Task : Get the difference between the largest and smallest value within each group
df = dt.Frame("""GROUP VALUE
1 5
2 2
1 10
2 20
1 7""")
df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]
GROUP C0
0 1 5
1 2 18
Null values are not excluded from the grouping column
df = dt.Frame(""" a b c
1 2.0 3
1 NaN 4
2 1.0 3
1 2.0 2""")
df[:, sum(f[:]), by('b')]
b a c
0 NA 1 4
1 1 2 3
2 2 2 5
If you wish to ignore null values, first filter them out
df[f.b != None, :][:, sum(f[:]), by('b')]
b a c
0 1 2 3
1 2 2 5
Filtration¶
This occurs in the i
section of the groupby, where only a subset of the data per group is needed; selection is limited to integers or slicing.
- Note:
i
is applied after the grouping, not before.f-expressions in the
i
section is not yet implemented for groupby.
Select the first row per group
df = dt.Frame("""A B
1 10
1 20
2 30
2 40
3 10""")
# passing 0 as index gets the first row after the grouping
# note that python's index starts from 0, not 1
df[0, :, by('A')]
A B
0 1 10
1 2 30
2 3 10
Select the last row per group
df[-1, :, by('A')]
A B
0 1 20
1 2 40
2 3 10
Select the nth row per group
Task : select the second row per group
df[1, :, by('A')]
A B
0 1 20
1 2 40
- Note:
Filtering this way can be used to drop duplicates; you can decide to keep the first or last non-duplicate.
Select the latest entry per group
df = dt.Frame("""id product date
220 6647 2014-09-01
220 6647 2014-09-03
220 6647 2014-10-16
826 3380 2014-11-11
826 3380 2014-12-09
826 3380 2015-05-19
901 4555 2014-09-01
901 4555 2014-10-05
901 4555 2014-11-01""")
df[-1, :, by('id'), sort('date')]
id product date
0 220 6647 2014-10-16
1 826 3380 2015-05-19
2 901 4555 2014-11-01
- Note:
-If
sort
andby
modifiers are present, the sorting occurs after the grouping, and occurs within each group.
Replicate
SQL
’sHAVING
clauseTask: Filter for groups where the length/count is greater than 1
df = dt.Frame([[1, 1, 5], [2, 3, 6]], names=['A', 'B'])
df
A B
0 1 2
1 1 3
2 5 6
# Get the count of each group,
# and assign to a new column, using the update method
# note that the update operation is in-place;
# there is no need to assign back to the dataframe
df[:, update(filter_col = count()), by('A')]
# The new column will be added to the end
# We use an f-expression to return rows
# in each group where the count is greater than 1
df[f.filter_col > 1, f[:-1]]
A B
0 1 2
1 1 3
Keep only rows per group where
diff
is the minimum
df = dt.Frame(""" item diff otherstuff
1 2 1
1 1 2
1 3 7
2 -1 0
2 1 3
2 4 9
2 -6 2
3 0 0
3 2 9""")
df[:,
#get boolean for rows where diff column is minimum for each group
update(filter_col = f.diff == min(f.diff)),
by('item')]
df[f.filter_col == 1, :-1]
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
Keep only entries where
make
has both 0 and 1 insales
df = dt.Frame(""" make country other_columns sale
honda tokyo data 1
honda hirosima data 0
toyota tokyo data 1
toyota hirosima data 0
suzuki tokyo data 0
suzuki hirosima data 0
ferrari tokyo data 1
ferrari hirosima data 0
nissan tokyo data 1
nissan hirosima data 0""")
df[:,
update(filter_col = sum(f.sale)),
by('make')]
df[f.filter_col == 1, :-1]
make country other_columns sale
0 honda tokyo data 1
1 honda hirosima data 0
2 toyota tokyo data 1
3 toyota hirosima data 0
4 ferrari tokyo data 1
5 ferrari hirosima data 0
6 nissan tokyo data 1
7 nissan hirosima data 0
Transformation¶
This is when a function is applied to a column after a groupby and the resulting column is appended back to the dataframe. The number of rows of the dataframe is unchanged. The update()
method makes this possible and easy. Let’s look at a couple of examples:
Get the minimum and maximum of column
c
per group, and append to dataframe
df = dt.Frame(""" c y
9 0
8 0
3 1
6 2
1 3
2 3
5 3
4 4
0 4
7 4""")
# Assign the new columns via the update method
df[:,
update(min_col = min(f.c),
max_col = max(f.c)),
by('y')]
df
c y min_col max_col
0 9 0 8 9
1 8 0 8 9
2 3 1 3 3
3 6 2 6 6
4 1 3 1 5
5 2 3 1 5
6 5 3 1 5
7 4 4 0 7
8 0 4 0 7
9 7 4 0 7
Fill missing values by group mean
df = dt.Frame({'value' : [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
'name' : ['A','A', 'B','B','B','B', 'C','C','C']})
df
value name
0 1 A
1 NA A
2 NA B
3 2 B
4 3 B
5 1 B
6 3 C
7 NA C
8 3 C
# This uses a combination of update and ifelse methods:
df[:,
update(value = ifelse(f.value == None,
mean(f.value),
f.value)),
by('name')]
df
value name
0 1 A
1 1 A
2 2 B
3 2 B
4 3 B
5 1 B
6 3 C
7 3 C
8 3 C
Transform and Aggregate on Multiple Columns
Task: Get the sum of the aggregate of column
a
andb
, grouped byc
andd
and append to dataframe.
df = dt.Frame({'a' : [1,2,3,4,5,6],
'b' : [1,2,3,4,5,6],
'c' : ['q', 'q', 'q', 'q', 'w', 'w'],
'd' : ['z','z','z','o','o','o']})
df
a b c d
0 1 1 q z
1 2 2 q z
2 3 3 q z
3 4 4 q o
4 5 5 w o
5 6 6 w o
df[:,
update(e = sum(f.a) + sum(f.b)),
by('c', 'd')
]
df
a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22
Replicate R’s groupby mutate
Task : Get ratio by dividing column
c
by the product of columnc
andd
, grouped bya
andb
df = dt.Frame(dict(a = (1,1,0,1,0),
b = (1,0,0,1,0),
c = (10,5,1,5,10),
d = (3,1,2,1,2))
)
df
a b c d
0 1 1 10 3
1 1 0 5 1
2 0 0 1 2
3 1 1 5 1
4 0 0 10 2
df[:,
update(ratio = f.c / sum(f.c * f.d)),
by('a', 'b')
]
df
a b c d ratio
0 1 1 10 3 0.285714
1 1 0 5 1 1
2 0 0 1 2 0.0454545
3 1 1 5 1 0.142857
4 0 0 10 2 0.454545
Groupby on Boolean Expressions¶
Conditional Sum with groupby
Task : Sum
data1
column, grouped bykey1
and rows wherekey2== "one"
df = dt.Frame("""data1 data2 key1 key2
0.361601 0.375297 a one
0.069889 0.809772 a two
1.468194 0.272929 b one
-1.138458 0.865060 b two
-0.268210 1.250340 a one""")
df[:,
sum(f.data1),
by(f.key2 == "one", f.key1)][f.C0 == 1, 1:]
key1 data1
0 a 0.093391
1 b 1.46819
Conditional Sums based on various Criteria
df = dt.Frame(""" A_id B C
a1 "up" 100
a2 "down" 102
a3 "up" 100
a3 "up" 250
a4 "left" 100
a5 "right" 102""")
df[:,
{"sum_up": sum(f.B == "up"),
"sum_down" : sum(f.B == "down"),
"over_200_up" : sum((f.B == "up") & (f.C > 200))
},
by('A_id')]
A_id sum_up sum_down over_200_up
0 a1 1 0 0
1 a2 0 1 0
2 a3 2 0 1
3 a4 0 0 0
4 a5 0 0 0
More Examples¶
Aggregation on Values in a Column
Task : group by
Day
and find minimumData_Value
forTMIN
and maximumData_Value
forTMAX
df = dt.Frame(""" Day Element Data_Value
01-01 TMAX 112
01-01 TMAX 101
01-01 TMIN 60
01-01 TMIN 0
01-01 TMIN 25
01-01 TMAX 113
01-01 TMAX 115
01-01 TMAX 105
01-01 TMAX 111
01-01 TMIN 44
01-01 TMIN 83
01-02 TMAX 70
01-02 TMAX 79
01-02 TMIN 0
01-02 TMIN 60
01-02 TMAX 73
01-02 TMIN 31
01-02 TMIN 26
01-02 TMAX 71
01-02 TMIN 26""")
df[:,
f.Day.extend({"TMAX" : max(ifelse(f.Element=="TMAX",
f.Data_Value, None)),
"TMIN" : min(ifelse(f.Element=="TMIN",
f.Data_Value, None)}))
]
Day TMAX TMIN
0 01-01 115 0
1 01-02 79 0
Group By and Conditional Sum and add Back to Data Frame
Task: Sum the
Count
value for eachID
, whenNum
is (17 or 12) andLetter
is ‘D’ and also add the calculation back to the original data frame as ‘Total’
df = dt.Frame(""" ID Num Letter Count
1 17 D 1
1 12 D 2
1 13 D 3
2 17 D 4
2 12 A 5
2 16 D 1
3 16 D 1""")
expression = ((f.Num==17) | (f.Num==12)) & (f.Letter == "D")
df[:,
update(Total = sum(ifelse(expression, f.Count, 0))),
by('ID')]
df
ID Num Letter Count Total
0 1 17 D 1 3
1 1 12 D 2 3
2 1 13 D 3 3
3 2 17 D 4 4
4 2 12 A 5 4
5 2 16 D 1 4
6 3 16 D 1 0
Multiple indexing with multiple min and max in one aggregate
Task : find
col1
wherecol2
is max,col2
wherecol3
is min andcol1
wherecol3
is max
df = dt.Frame({
"id" : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
"col1" : [1, 3, 5, 2, 5, 3, 6, 3, 67, 7],
"col2" : [4, 6, 8, 3, 65, 3, 5, 4, 4, 7],
"col3" : [34, 64, 53, 5, 6, 2, 4, 6, 4, 67],
})
df
id col1 col2 col3
0 1 1 4 34
1 1 3 6 64
2 1 5 8 53
3 2 2 3 5
4 2 5 65 6
5 2 3 3 2
6 2 6 5 4
7 3 3 4 6
8 3 67 4 4
9 3 7 7 67
df[:,
{'col1' : max(ifelse(f.col2 == max(f.col2),
f.col1, None)),
'col2' : max(ifelse(f.col3 == min(f.col3),
f.col2, None)),
'col3' : max(ifelse(f.col3 == max(f.col3),
f.col1, None))
},
by('id')]
id col1 col2 col3
0 1 5 4 3
1 2 5 3 5
2 3 7 4 7
Filter row based on aggregate value
Task : Find, for every
word
, thetag
that has the mostcount
df = dt.Frame("""word tag count
a S 30
the S 20
a T 60
an T 5
the T 10""")
# The solution builds on the knowledge that sorting
# while grouping sorts within each group.
df[0, :, by('word'), sort(-f.count)]
word tag count
0 a T 60
1 an T 5
2 the S 20
Get the rows where the
value
column is minimum, and rename columns
df = dt.Frame({"category": ["A"]*3 + ["B"]*3,
"date": ["9/6/2016", "10/6/2016",
"11/6/2016", "9/7/2016",
"10/7/2016", "11/7/2016"],
"value": [7,8,9,10,1,2]})
df
category date value
0 A 9/6/2016 7
1 A 10/6/2016 8
2 A 11/6/2016 9
3 B 9/7/2016 10
4 B 10/7/2016 1
5 B 11/7/2016 2
df[0,
{"value_date": f.date,
"value_min": f.value},
by("category"),
sort('value')]
category value_date value_min
0 A 9/6/2016 7
1 B 10/7/2016 1
Using the same data in the last example, get the rows where the
value
column is maximum, and rename columns
df[0,
{"value_date": f.date,
"value_max": f.value},
by("category"),
sort(-f.value)]
category value_date value_max
0 A 11/6/2016 9
1 B 9/7/2016 10
Get the average of the last three instances per group
import random
random.seed(3)
df = dt.Frame({"Student": ["Bob", "Bill",
"Bob", "Bob",
"Bill","Joe",
"Joe", "Bill",
"Bob", "Joe",],
"Score": random.sample(range(10,30), 10)})
df
Student Score
0 Bob 17
1 Bill 28
2 Bob 27
3 Bob 14
4 Bill 21
5 Joe 24
6 Joe 19
7 Bill 29
8 Bob 20
9 Joe 23
df[-3:, mean(f[:]), f.Student]
Student Score
0 Bill 26
1 Bob 20.3333
2 Joe 22
Group by on a condition
Get the sum of
Amount
forNumber
in range (1 to 4) and (5 and above)
df = dt.Frame("""Number, Amount
1, 5
2, 10
3, 11
4, 3
5, 5
6, 8
7, 9
8, 6""")
df[:, sum(f.Amount), by(ifelse(f.Number>=5, "B","A"))]
C0 Amount
0 A 29
1 B 28
Row Functions¶
rowall
, rowany
, rowcount
, rowfirst
, rowlast
, rowmax
, rowmean
, rowmin
, rowsd
, rowsum
are functions that aggregate across rows instead of columns and return a single column. These functions are equivalent to Pandas aggregation functions with parameter (axis=1)
.f.A + f.B + f.C + f.D
. Rowsum makes it easier - dt.rowsum(f['A':'D'])
.Rowall, Rowany¶
These work only on Boolean expressions - rowall
checks if all the values in the row are True
, while rowany
checks if any value in the row is True. It is similar to Pandas’ all or any with a parameter of (axis=1)
. A single Boolean column is returned.
from datatable import dt, f, by
df = dt.Frame({'A': [True, True], 'B': [True, False]})
df
A B
0 1 1
1 1 0
# rowall :
df[:, dt.rowall(f[:])]
C0
0 1
1 0
# rowany :
df[:, dt.rowany(f[:])]
C0
0 1
1 1
The single boolean column that is returned can be very handy when filtering in the i
section.
Filter for rows where at least one cell is greater than 0
df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0], 'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0], 'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}) df a b c d e f 0 0 0 0 0 0 0 1 0 0 0 0 0 1 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 5 0 0 0 6 1 0 0 0 0 0 7 0 0 0 0 0 0 8 0 0 0 1 0 0 9 1 0 0 0 0 0 10 0 0 0 0 0 0 df[dt.rowany(f[:] > 0), :] a b c d e f 0 0 0 0 0 0 1 1 0 0 5 0 0 0 2 1 0 0 0 0 0 3 0 0 0 1 0 0 4 1 0 0 0 0 0
Filter for rows where all the cells are 0
df[dt.rowall(f[:] == 0), :] a b c d e f 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0
Filter for rows where all the columns’ values are the same
df = dt.Frame("""Name A1 A2 A3 A4 deff 0 0 0 0 def1 0 1 0 0 def2 0 0 0 0 def3 1 0 0 0 def4 0 0 0 0""") # compare the first integer column with the rest, # use rowall to find rows where all is True # and filter with the resulting boolean df[dt.rowall(f[1]==f[1:]), :] Name A1 A2 A3 A4 0 deff 0 0 0 0 1 def2 0 0 0 0 2 def4 0 0 0 0
Filter for rows where the values are increasing
df = dt.Frame({"A": [1, 2, 6, 4], "B": [2, 4, 5, 6], "C": [3, 5, 4, 7], "D": [4, -3, 3, 8], "E": [5, 1, 2, 9]}) df A B C D E 0 1 2 3 4 5 1 2 4 5 −3 1 2 6 5 4 3 2 3 4 6 7 8 9 df[dt.rowall(f[1:] >= f[:-1]), :] A B C D E 0 1 2 3 4 5 1 4 6 7 8 9
Rowfirst, Rowlast¶
These look for the first and last non-missing value in a row respectively.
df = dt.Frame({'A':[1, None, None, None],
'B':[None, 3, 4, None],
'C':[2, None, 5, None]})
df
A B C
0 1 NA 2
1 NA 3 NA
2 NA 4 5
3 NA NA NA
# rowfirst :
df[:, dt.rowfirst(f[:])]
C0
0 1
1 3
2 4
3 NA
# rowlast :
df[:, dt.rowlast(f[:])]
C0
0 2
1 3
2 5
3 NA
Get rows where the last value in the row is greater than the first value in the row
df = dt.Frame({'a': [50, 40, 30, 20, 10], 'b': [60, 10, 40, 0, 5], 'c': [40, 30, 20, 30, 40]}) df a b c 0 50 60 40 1 40 10 30 2 30 40 20 3 20 0 30 4 10 5 40 df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :] a b c 0 20 0 30 1 10 5 40
Rowmax, Rowmin¶
These get the maximum and minimum values per row, respectively.
df = dt.Frame({"C": [2, 5, 30, 20, 10],
"D": [10, 8, 20, 20, 1]})
df
C D
0 2 10
1 5 8
2 30 20
3 20 20
4 10 1
# rowmax
df[:, dt.rowmax(f[:])]
C0
0 10
1 8
2 30
3 20
4 10
# rowmin
df[:, dt.rowmin(f[:])]
C0
0 2
1 5
2 20
3 20
4 1
Find the difference between the maximum and minimum of each row
df = dt.Frame("""Value1 Value2 Value3 Value4 5 4 3 2 4 3 2 1 3 3 5 1""") df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))] df Value1 Value2 Value3 Value4 max_min 5 4 3 2 3 4 3 2 1 3 3 3 5 1 4
Rowsum, Rowmean, Rowcount, Rowsd¶
rowsum
and rowmean
get the sum and mean of rows respectively; rowcount
counts the number of non-missing values in a row, while rowsd
aggregates a row to get the standard deviation
Get the count, sum, mean and standard deviation for each row
df = dt.Frame("""ORD A B C D 198 23 45 NaN 12 138 25 NaN NaN 62 625 52 36 49 35 457 NaN NaN NaN 82 626 52 32 39 45""") df[:, dt.update(rowcount = dt.rowcount(f[:]), rowsum = dt.rowsum(f[:]), rowmean = dt.rowmean(f[:]), rowsd = dt.rowsd(f[:]) )] df ORD A B C D rowcount rowsum rowmean rowsd 0 198 23 45 NA 12 4 278 69.5 86.7583 1 138 25 NA NA 62 3 225 75 57.6108 2 625 52 36 49 35 5 797 159.4 260.389 3 457 NA NA NA 82 2 539 269.5 265.165 4 626 52 32 39 45 5 794 158.8 261.277
Find rows where the number of nulls is greater than 3
df = dt.Frame({'city': ["city1", "city2", "city3", "city4"], 'state': ["state1", "state2", "state3", "state4"], '2005': [144, 205, 123, None], '2006': [173, 211, 123, 124], '2007': [None, None, None, None], '2008': [None, 206, None, None], '2009': [None, None, 124, 123], '2010': [128, 273, None, None]}) df city state 2005 2006 2007 2008 2009 2010 0 city1 state1 144 173 NA NA NA 128 1 city2 state2 205 211 NA 206 NA 273 2 city3 state3 123 123 NA NA 124 NA 3 city4 state4 NA 124 NA NA 123 NA # get columns that are null, then sum on the rows # and finally filter where the sum is greater than 3 df[dt.rowsum(dt.isna(f[:])) > 3, :] city state 2005 2006 2007 2008 2009 2010 0 city4 state4 NA 124 NA NA 123 NA
Rowwise sum of the float columns
df = dt.Frame("""ID W_1 W_2 W_3 1 0.1 0.2 0.3 1 0.2 0.4 0.5 2 0.3 0.3 0.2 2 0.1 0.3 0.4 2 0.2 0.0 0.5 1 0.5 0.3 0.2 1 0.4 0.2 0.1""") df[:, dt.update(sum_floats = dt.rowsum(f[float]))] ID W_1 W_2 W_3 sum_floats 0 1 0.1 0.2 0.3 0.6 1 1 0.2 0.4 0.5 1.1 2 2 0.3 0.3 0.2 0.8 3 2 0.1 0.3 0.4 0.8 4 2 0.2 0 0.5 0.7 5 1 0.5 0.3 0.2 1 6 1 0.4 0.2 0.1 0.7
More Examples¶
Divide columns
A
,B
,C
,D
by the total column, square it and sum rowwisedf = dt.Frame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]}) df A B C D total 0 2 1 0 1 4 1 3 2 1 0 6 df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))] df A B C D total result 0 2 1 0 1 4 0.375 1 3 2 1 0 6 0.388889
Get the row sum of the
COUNT
columnsdf = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3 A 1 0 1 1 A 2 1 1 2 A 3 3 0 0""") columns = [f[column] for column in df.names if column.startswith("COUNT")] df[:, update(total = dt.rowsum(columns))] df USER OBSERVATION COUNT.1 COUNT.2 COUNT.3 total 0 A 1 0 1 1 2 1 A 2 1 1 2 4 2 A 3 3 0 0 3
Sum selected columns rowwise
df = dt.Frame({'location' : ("a","b","c","d"), 'v1' : (3,4,3,3), 'v2' : (4,56,3,88), 'v3' : (7,6,2,9), 'v4': (7,6,1,9), 'v5' : (4,4,7,9), 'v6' : (2,8,4,6)}) df location v1 v2 v3 v4 v5 v6 0 a 3 4 7 7 4 2 1 b 4 56 6 6 4 8 2 c 3 3 2 1 7 4 3 d 3 88 9 9 9 6 df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}] x1 x2 0 14 13 1 66 18 2 8 12 3 100 24
Comparison with R’s data.table¶
datatable
is closely related to R’s data.table attempts to mimic its core algorithms and API; however, there are differences due to language constraints.
This page shows how to perform similar basic operations in R’s data.table versus datatable
.
Subsetting Rows¶
The examples used here are from the examples data in R’s data.table.
data.table
:
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3),
y=c(1,3,6), v=1:9)
datatable
:
from datatable import dt, f, g, by, update, join, sort
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
Action |
data.table |
datatable |
---|---|---|
Select 2nd row |
|
|
Select 2nd and 3rd row |
|
|
Select 3rd and 2nd row |
|
|
Select 2nd and 5th rows |
|
|
Select all rows from 2nd to 5th |
|
|
Select rows in reverse from 5th to the 1st |
|
|
Select the last row |
|
|
All rows where |
|
|
Compound logical expressions |
|
|
All rows other than rows 2,3,4 |
|
|
Sort by column |
|
DT.sort("x") orDT[:, :, sort("x")] |
Sort by column |
|
DT.sort(-f.x) orDT[:, :, sort(-f.x)] |
Sort by column |
|
DT.sort(x, -f.y) orDT[:, :, sort(f.x, -f.y)] |
Note the use of the f
symbol when performing computations or sorting in descending order. You can read more about f-expressions.
Note: In data.table
, DT[2]
would mean 2nd row
, whereas in datatable
, DT[2]
would select the 3rd column.
Selecting Columns¶
Action |
data.table |
datatable |
---|---|---|
Select column |
|
|
Select multiple columns |
|
|
Rename and select column |
|
|
Sum column |
|
|
Return two columns, |
|
|
Select the second column |
|
|
Select last column |
|
|
Select columns |
|
|
Exclude columns |
|
DT[:, [name not in ("x","y") for name in DT.names]] orDT[:, f[:].remove(f['x':'y'])] |
Select columns that start with |
|
DT[:, [name.startswith(("x", "v")) for name in DT.names]] |
In data.table
, you can select a column by using a variable name with the double dots prefix
cols = 'v'
DT[, ..cols]
In datatable
, you do not need the prefix
cols = 'v'
DT[cols] # or DT[:, cols]
If the column names are stored in a character vector, the double dots prefix also works
cols = c('v', 'y')
DT[, ..cols]
In datatable
, you can store the list/tuple of column names in a variable
cols = ('v', 'y')
DT[:, cols]
Subset rows and Select/Aggregate¶
Action |
data.table |
datatable |
---|---|---|
Sum column |
|
|
Same as above, new column name |
|
|
Filter in |
|
|
Same as above, return as scalar |
|
|
In R, indexing starts at 1 and when slicing, the first and last items are included. However, in Python, indexing starts at 0, and when slicing, all items except the last are included.
Some SD
(Subset of Data) operations can be replicated in datatable
Aggregate several columns
# data.table
DT[, lapply(.SD, mean),
.SDcols = c("y","v")]
y v
1: 3.333333 5
# datatable
DT[:, dt.mean([f.y,f.v])]
y v
0 3.33333 5
Modify columns using a condition
# data.table
DT[, .SD - 1,
.SDcols = is.numeric]
y v
1: 0 0
2: 2 1
3: 5 2
4: 0 3
5: 2 4
6: 5 5
7: 0 6
8: 2 7
9: 5 8
# datatable
DT[:, f[int]-1]
C0 C1
0 0 0
1 2 1
2 5 2
3 0 3
4 2 4
5 5 5
6 0 6
7 2 7
8 5 8
Modify several columns and keep others unchanged
#data.table
DT[, c("y", "v") := lapply(.SD, sqrt),
.SDcols = c("y", "v")]
x y v
1: b 1.000000 1.000000
2: b 1.732051 1.414214
3: b 2.449490 1.732051
4: a 1.000000 2.000000
5: a 1.732051 2.236068
6: a 2.449490 2.449490
7: c 1.000000 2.645751
8: c 1.732051 2.828427
9: c 2.449490 3.000000
#datatable
# there is a square root function the datatable math module
DT[:, update(**{name:f[name]**0.5 for name in ("y","v")})]
x y v
0 b 1 1
1 b 1.73205 1.41421
2 b 2.44949 1.73205
3 a 1 2
4 a 1.73205 2.23607
5 a 2.44949 2.44949
6 c 1 2.64575
7 c 1.73205 2.82843
8 c 2.44949 3
Grouping with by()
¶
Action |
data.table |
datatable |
---|---|---|
Get the sum of column |
|
|
Get sum of |
|
|
Number of rows per group |
|
|
Select first row of |
|
|
Get row count and sum columns |
|
|
Expressions in |
|
|
Get row per group where column |
|
|
First 2 rows of each group |
|
|
Last 2 rows of each group |
|
|
In R’s data.table, the order of the groupings is preserved; in datatable
, the returned dataframe is sorted on the grouping column. DT[, sum(v), keyby=x]
in data.table returns a dataframe ordered by column x
.
In data.table
, i
is executed before the grouping, while in datatable
, i
is executed after the grouping.
Also, in datatable
, f-expressions in the i
section of a groupby is not yet implemented, hence the chaining method to get the sum of column v
where x!=a
.
Multiple aggregations within a group can be executed in R’s data.table with the syntax below
DT[, list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x, y%%2)]
The same can be replicated in datatable
by using a dictionary
DT[:, {'MySum': dt.sum(f.v),
'MyMin': dt.min(f.v),
'MyMax': dt.max(f.v)},
by(f.x, f.y%2)]
Add/Update/Delete Columns¶
Action |
data.table |
datatable |
---|---|---|
Add new column |
|
DT[:, update(z=42)] orDT['z'] = 42 orDT[:, 'z'] = 42 orDT = DT[:, f[:].extend({"z":42})] |
Add multiple columns |
|
DT[:, update(sv = dt.sum(f.v), mv = "X")] orDT[:, f[:].extend({"sv": dt.sum(f.v), "mv": "X"})] |
Remove column |
|
del DT['z'] ordel DT[:, 'z'] orDT = DT[:, f[:].remove(f.z)] |
Subassign to existing |
|
DT[f.x=="a", update(v=42)] orDT[f.x=="a", 'v'] = 42 |
Subassign to new column (NA padded) |
|
DT[f.x=="b", update(v2=84)] orDT[f.x=='b', 'v2'] = 84 |
Add new column, assigning values group-wise |
|
DT[:, update(m=dt.mean(f.v)), by("x")] |
In data.table
, you can create a new column with a variable
cols = 'rar'
DT[, ..cols:=4242]
Similar operation for the above in datatable
cols = 'rar'
DT[cols] = 4242
# or DT[:, update(cols=4242)]
Note that the update()
function, as well as the del
function (a python keyword) operates in-place; there is no need for reassignment. Another advantage of the update()
method is that the row order of the dataframe is not changed, even in a groupby; this comes in handy in a lot of transformation operations.
Joins¶
At the moment, only the left outer join is implemented in datatable
. Another aspect is that the dataframe being joined must be keyed, the column or columns to be keyed must not have duplicates, and the joining column has to have the same name in both dataframes. You can read more about the join()
API and have a look at the Tutorial on the join operator
Left join in R’s data.table:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X[DT, on="x"]
x v foo y i.v
1: b 7 2 1 1
2: b 7 2 3 2
3: b 7 2 6 3
4: a NA NA 1 4
5: a NA NA 3 5
6: a NA NA 6 6
7: c 8 4 1 7
8: c 8 4 3 8
9: c 8 4 6 9
Join in datatable
:
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
X = dt.Frame({"x":('c','b'),
"v":(8,7),
"foo":(4,2)})
X.key="x" # key the ``x`` column
DT[:, :, join(X)]
x y v v.0 foo
0 b 1 1 7 2
1 b 3 2 7 2
2 b 6 3 7 2
3 a 1 4 NA NA
4 a 3 5 NA NA
5 a 6 6 NA NA
6 c 1 7 8 4
7 c 3 8 8 4
8 c 6 9 8 4
An inner join could be simulated by removing the nulls. Again, a
join()
only works if the joining dataframe is keyed.
# data.table
DT[X, on="x", nomatch=NULL]
x y v i.v foo
1: c 1 7 8 4
2: c 3 8 8 4
3: c 6 9 8 4
4: b 1 1 7 2
5: b 3 2 7 2
6: b 6 3 7 2
# datatable
DT[g[-1]!=None, :, join(X)] # g refers to the joining dataframe X
x y v v.0 foo
0 b 1 1 7 2
1 b 3 2 7 2
2 b 6 3 7 2
3 c 1 7 8 4
4 c 3 8 8 4
5 c 6 9 8 4
A not join can be simulated as well.
# data.table
DT[!X, on="x"]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
# datatable
DT[g[-1]==None, f[:], join(X)]
x y v
0 a 1 4
1 a 3 5
2 a 6 6
Select the first row for each group
# data.table
DT[X, on="x", mult="first"]
x y v i.v foo
1: c 1 7 8 4
2: b 1 1 7 2
# datatable
DT[g[-1]!=None, :, join(X)][0, :, by('x')] # chaining comes in handy here
x y v v.0 foo
0 b 1 1 7 2
1 c 1 7 8 4
Select the last row for each group
# data.table
DT[X, on="x", mult="last"]
x y v i.v foo
1: c 6 9 8 4
2: b 6 3 7 2
# datatable
DT[g[-1]!=None, :, join(X)][-1, :, by('x')]
x y v v.0 foo
0 b 6 3 7 2
1 c 6 9 8 4
Join and evaluate
j
for each row ini
# data.table
DT[X, sum(v), by=.EACHI, on="x"]
x V1
1: c 24
2: b 6
# datatable
DT[g[-1]!=None, :, join(X)][:, dt.sum(f.v), by("x")]
x v
0 b 6
1 c 24
Aggregate on columns from both dataframes in
j
# data.table
DT[X, sum(v)*foo, by=.EACHI, on="x"]
x V1
1: c 96
2: b 12
# datatable
DT[:, dt.sum(f.v*g.foo), join(X), by(f.x)][f[-1]!=0, :]
x C0
0 b 12
1 c 96
Aggregate on columns with same name from both dataframes in
j
# data.table
DT[X, sum(v)*i.v, by=.EACHI, on="x"]
x V1
1: c 192
2: b 42
# datatable
DT[:, dt.sum(f.v*g.v), join(X), by(f.x)][f[-1]!=0, :]
x C0
0 b 42
1 c 192
Expect significant improvement in join functionality, with more concise syntax, as well as additions of more features, as datatable
matures.
Functions in R/data.table not yet implemented¶
This is a list of some functions in data.table
that do not have an equivalent in datatable
yet, that we would likely implement
- Conditional functions
- Aggregation functions
Also, at the moment, custom aggregations in the j
section are not supported in datatable
- we intend to implement that at some point.
There are no datetime functions in datatable
, and string operations are limited as well.
If there are any functions that you would like to see in datatable
, please head over to github and raise a feature request.
FTRL Model¶
This section describes the FTRL (Follow the Regularized Leader) model as implemented in datatable.
FTRL Model Information¶
The Follow the Regularized Leader (FTRL) model is a datatable implementation of the FTRL-Proximal online learning algorithm for binomial logistic regression. It uses a hashing trick for feature vectorization and the Hogwild approach for parallelization. FTRL for multinomial classification and continuous targets are implemented experimentally.
Create an FTRL Model¶
The FTRL model is implemented as the Ftrl
Python class, which is a part of
datatable.models
, so to use the model you should first do
from datatable.models import Ftrl
and then create a model as
ftrl_model = Ftrl()
FTRL Model Parameters¶
The FTRL model requires a list of parameters for training and making predictions, namely:
alpha
– learning rate, defaults to0.005
.beta
– beta parameter, defaults to1.0
.lambda1
– L1 regularization parameter, defaults to0.0
.lambda2
– L2 regularization parameter, defaults to1.0
.nbins
– the number of bins for the hashing trick, defaults to10**6
.mantissa_nbits
– the number of bits from mantissa to be used for hashing, defaults to10
.nepochs
– the number of epochs to train the model for, defaults to1
.negative_class
– whether to create and train on a “negative” class in the case of multinomial classification, defaults toFalse
.interactions
— a list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. This setting defaults toNone
.model_type
— training mode that can be one of the following: “auto
” to automatically set model type based on the target column data, “binomial
” for binomial classification, “multinomial
” for multinomial classification or “regression
” for continuous targets. Defaults to"auto"
.
If some parameters need to be changed from their default values, this can be done either when creating the model, as
ftrl_model = Ftrl(alpha = 0.1, nbins = 100)
or, if the model already exists, as
ftrl_model.alpha = 0.1
ftrl_model.nbins = 100
If some parameters were not set explicitely, they will be assigned the default values.
Training a Model¶
Use the fit()
method to train a model:
ftrl_model.fit(X_train, y_train)
where X_train
is a frame of shape (nrows, ncols)
to be trained on,
and y_train
is a target frame of shape (nrows, 1)
. The following
datatable column types are supported for the X_train
frame: bool
,
int
, real
and str
.
FTRL model can also do early stopping, if relative validation error does not improve. For this the model should be fit as
res = ftrl_model.fit(X_train, y_train, X_validation, y_validation,
nepochs_validation, validation_error,
validation_average_niterations)
where X_train
and y_train
are training and target frames,
respectively, X_validation
and y_validation
are validation frames,
nepochs_validation
specifies how often, in epoch units, validation
error should be checked, validation_error
is the relative
validation error improvement that the model should demonstrate within
nepochs_validation
to continue training, and
validation_average_niterations
is the number of iterations
to average when calculating the validation error. Returned res
tuple contains epoch at which training stopped and the corresponding loss.
Resetting a Model¶
Use the reset()
method to reset a model:
ftrl_model.reset()
This will reset model weights, but it will not affect learning parameters. To reset parameters to default values, you can do
ftrl_model.params = Ftrl().params
Making Predictions¶
Use the predict()
method to make predictions:
targets = ftrl_model.predict(X)
where X
is a frame of shape (nrows, ncols)
to make predictions for.
X
should have the same number of columns as the training frame.
The predict()
method returns a new frame of shape (nrows, 1)
with
the predicted probability for each row of frame X
.
Feature Importances¶
To estimate feature importances, the overall weight contributions are calculated feature-wise during training and predicting. Feature importances can be accessed as
fi = ftrl_model.feature_importances
where fi
will be a frame of shape (nfeatures, 2)
containing
feature names and their importances, that are normalized to [0; 1] range.
Feature Interactions¶
By default each column of a training dataset is considered as a feature by FTRL model. User can provide additional features by specifying a list or a tuple of feature interactions, for instance as
ftrl_model.interactions = [["C0", "C1", "C3"], ["C2", "C5"]]
where C*
are column names from a training dataset. In the above example
two additional features, namely, C0:C1:C3
and C2:C5
, are created.
interactions
should be set before a call to fit()
method, and can not be
changed once the model is trained.
Further Reading¶
For detailed help, please also refer to help(Ftrl)
.
datatable API¶
Symbols listed here are available for import from the datatable
module.
Submodules¶
Mathematical functions, similar to python’s |
|
A small set of data analysis tools. |
|
Access to some internal details of |
Classes¶
Main “table of data” class. This is the equivalent of pandas’ or Julia’s
|
|
Helper class for computing formulas over a frame. |
|
Helper class for addressing columns in a frame. |
|
Enum of column “storage” types, analogous to numpy’s |
|
Enum of column “logical” types, similar to standard Python notion
of a |
Functions¶
Read CSV/text/XLSX/Jay/other files |
|
Same as |
|
Group-by clause for use in Frame’s square-bracket selector |
|
Join clause for use in Frame’s square-bracket selector |
|
Sort clause for use in Frame’s square-bracket selector |
|
Create new or update existing columns within a frame |
|
Combine frames by columns |
|
Combine frames by rows |
|
Concatenate frame by rows |
|
Ternary if operator |
|
Shift column by a given number of rows |
|
Bin a column into equal-width intervals |
|
Bin a column into equal-population intervals |
|
Split and nhot-encode a single-column frame |
|
Inject datatable’s stylesheets into the Jupyter notebook |
|
Row-wise all() function |
|
Row-wise any() function |
|
Calculate the number of non-missing values per row |
|
Find the first non-missing value row-wise |
|
Find the last non-missing value row-wise |
|
Find the largest element row-wise |
|
Calculate the mean value row-wise |
|
Find the smallest element row-wise |
|
Calculate the standard deviation row-wise |
|
Calculate the sum of all values row-wise |
|
Calculate the set intersection of values in the frames |
|
Calculate the set difference between the frames |
|
Calculate the symmetric difference between the sets of values in the frames |
|
Calculate the union of values in the frames |
|
Find unique values in a frame |
|
Calculate correlation between two columns |
|
Count non-missing values per a column |
|
Calculate covariance between two columns |
|
Find the largest element per a column |
|
Calculate mean value per a column |
|
Find the median element per a column |
|
Find the smallest element per a column |
|
Calculate the standard deviation per a column |
|
Calculate the sum of all values per a column |
Other¶
Information about the build of the datatable module. |
|
The datatable module. |
|
The primary namespace used during |
|
Secondary namespace used during |
|
datatable options. |
datatable.internal¶
The functions in this sub-module are considered to be “internal” and
not useful for day-to-day work with datatable
module.
Compiler used when building datatable. |
|
C pointer to column’s data |
|
Indicators of which columns in the frame are virtual. |
|
Run checks on whether the frame’s state is corrupted. |
|
Get ids of threads spawned by datatable. |
|
Was datatable built in debug mode? |
|
Was datatable built with support for regular expressions? |
datatable.internal.compiler_version()¶
Return the version of the C++ compiler used to compile this module.
Deprecated since version 0.11.0.
datatable.internal.frame_column_data_r()¶
datatable.internal.frame_columns_virtual()¶
datatable.internal.frame_integrity_check()¶
This function performs a range of tests on the frame
to verify
that its internal state is consistent. It returns None on success,
or throws an AssertionError
if any problems were found.
datatable.internal.get_thread_ids()¶
Return system ids of all threads used internally by datatable.
Calling this function will cause the threads to spawn if they haven’t done already. (This behavior may change in the future).
List[str]
The list of thread ids used by the datatable. The first element in the list is the id of the main thread.
dt.options.nthreads
– global option that controls the number of threads in use.
datatable.internal.in_debug_mode()¶
Return True
if datatable
was compiled in debug mode.
Deprecated since version 0.11.0.
datatable.internal.regex_supported()¶
Was the datatable built with regular expression support?
Deprecated since version 0.11.0.
datatable.math¶
Trigonometric functions¶
Compute \(\sin x\) (the trigonometric sine of |
|
Compute \(\cos x\) (the trigonometric cosine of |
|
Compute \(\tan x\) (the trigonometric tangent of |
|
Compute \(\sin^{-1} x\) (the inverse sine of |
|
Compute \(\cos^{-1} x\) (the inverse cosine of |
|
Compute \(\tan^{-1} x\) (the inverse tangent of |
|
Compute \(\tan^{-1} (x/y)\). |
|
Compute \(\sqrt{x^2 + y^2}\). |
|
Convert an angle measured in degrees into radians. |
|
Convert an angle measured in radians into degrees. |
Hyperbolic functions¶
Compute \(\sinh x\) (the hyperbolic sine of |
|
Compute \(\cosh x\) (the hyperbolic cosine of |
|
Compute \(\tanh x\) (the hyperbolic tangent of |
|
Compute \(\sinh^{-1} x\) (the inverse hyperbolic sine of |
|
Compute \(\cosh^{-1} x\) (the inverse hyperbolic cosine of |
|
Compute \(\tanh^{-1} x\) (the inverse hyperbolic tangent of |
Exponential/logarithmic functions¶
Compute \(e^x\) (the exponent of |
|
Compute \(2^x\). |
|
Compute \(e^x - 1\). |
|
Compute \(\ln x\) (the natural logarithm of |
|
Compute \(\log_{10} x\) (the decimal logarithm of |
|
Compute \(\ln(1 + x)\). |
|
Compute \(\log_{2} x\) (the binary logarithm of |
|
Compute \(\ln(e^x + e^y)\). |
|
Compute \(\log_2(2^x + 2^y)\). |
|
Compute \(\sqrt[3]{x}\) (the cubic root of |
|
Compute \(x^a\). |
|
Compute \(\sqrt{x}\) (the square root of |
|
Compute \(x^2\) (the square of |
Special mathemetical functions¶
The error function \(\operatorname{erf}(x)\). |
|
The complimentary error function \(1 - \operatorname{erf}(x)\). |
|
Euler gamma function of |
|
Natual logarithm of the Euler gamma function of. |
Floating-point functions¶
Absolute value of |
|
The smallest integer not less than |
|
Number with the magnitude of |
|
The absolute value of |
|
The largest integer not greater than |
|
Remainder of a floating-point division |
|
Check whether |
|
Check if |
|
Check if |
|
Check if |
|
Compute \(x\cdot 2^y\). |
|
Round |
|
The sign of |
|
The sign of |
|
The value of |
Mathematical constants¶
Euler’s constant \(e\). |
|
Golden ratio \(\varphi\). |
|
Positive infinity. |
|
Not-a-number. |
|
Mathematical constant \(\pi\). |
|
Mathematical constant \(\tau\). |
Comparison table¶
The set of functions provided by the datatable.math
module is very
similar to the standard Python’s math
module, or
numpy math functions. Below is the comparison table showing which functions
are available:
math |
numpy |
datatable |
---|---|---|
Trigonometric/hyperbolic functions |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Exponential/logarithmic/power functions |
||
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
||
|
||
|
|
|
|
|
|
|
||
Special mathematical functions |
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Floating-point functions |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
||
|
|
|
|
||
|
||
|
||
|
|
|
Miscellaneous |
||
|
||
|
||
|
||
|
||
|
|
|
|
||
|
||
Mathematical constants |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
datatable.math.abs()¶
Return the absolute value of x
. This function can only be applied
to numeric arguments (i.e. boolean, integer, or real).
This function upcasts columns of types bool8
, int8
and int16
into
int32
; for columns of other types the stype is kept.
datatable.math.arccos()¶
Inverse trigonometric cosine of x
.
In mathematics, this may be written as \(\arccos x\) or \(\cos^{-1}x\).
The returned value is in the interval \([0, \frac12\tau]\),
and NA for the values of x
that lie outside the interval
[-1, 1]
. This function is the inverse of
cos()
in the sense that
cos(arccos(x)) == x
for all x
in the interval [-1, 1]
.
datatable.math.arcosh()¶
datatable.math.arcsin()¶
Inverse trigonometric sine of x
.
In mathematics, this may be written as \(\arcsin x\) or \(\sin^{-1}x\).
The returned value is in the interval \([-\frac14 \tau, \frac14\tau]\),
and NA for the values of x
that lie outside the interval [-1, 1]
.
This function is the inverse of sin()
in the sense
that sin(arcsin(x)) == x
for all x
in the interval [-1, 1]
.
datatable.math.arctan()¶
Inverse trigonometric tangent of x
.
This function satisfies the property that tan(arctan(x)) == x
.
atan2(x, y)
– two-argument inverse tangent function;tan(x)
– the trigonometric tangent function.
datatable.math.arsinh()¶
datatable.math.artanh()¶
datatable.math.atan2()¶
The inverse trigonometric tangent of y/x
, taking into account the signs
of x
and y
to produce the correct result.
If (x,y)
is a point in a Cartesian plane, then arctan2(y, x)
returns
the radian measure of an angle formed by two rays: one starting at the origin
and passing through point (0,1)
, and the other starting at the origin
and passing through point (x,y)
. The angle is assumed positive if the
rotation from the first ray to the second occurs counter-clockwise, and
negative otherwise.
As a special case, arctan2(0, 0) == 0
, and arctan2(0, -1) == tau/2
.
datatable.math.cbrt()¶
Cubic root of x.
datatable.math.ceil()¶
The smallest integer value not less than x
, returned as float.
This function produces a float32
column if the input is of type
float32
, or float64
columns for inputs of all other numeric
stypes.
datatable.math.copysign()¶
datatable.math.cos()¶
Compute the trigonometric cosine of angle x
measured in radians.
This function can only be applied to numeric columns (real, integer, or
boolean), and produces a float64 result, except when the argument x
is
float32, in which case the result is float32 as well.
datatable.math.cosh()¶
datatable.math.deg2rad()¶
Convert angle measured in degrees into radians: \(\operatorname{deg2rad}(x) = x\cdot\frac{\tau}{360}\).
rad2deg(x)
– convert radians into degrees.
datatable.math.e¶
datatable.math.erf()¶
datatable.math.erfc()¶
Complementary error function erfc(x) = 1 - erf(x)
.
The complementary error function is defined as the integral
Although mathematically erfc(x) = 1-erf(x)
, in practice the RHS
suffers catastrophic loss of precision at large values of x
. This
function, however, does not have such a drawback.
datatable.math.exp()¶
datatable.math.exp2()¶
datatable.math.expm1()¶
datatable.math.floor()¶
The largest integer value not greater than x
, returned as float.
This function produces a float32
column if the input is of type
float32
, or float64
columns for inputs of all other numeric
stypes.
datatable.math.fmod()¶
Floating-point remainder of the division x/y. The result is always
a float, even if the arguments are integers. This function uses
std::fmod()
from the standard C++ library, its convention for
handling of negative numbers may be different than the Python’s.
datatable.math.gamma()¶
Euler Gamma function of x.
The gamma function is defined for all x
except for the negative
integers. For positive x
it can be computed via the integral
For negative x
it can be computed as
where \(k\) is any integer such that \(x+k\) is positive.
If x
is a positive integer, then \(\Gamma(x) = (x - 1)!\).
datatable.math.golden¶
The golden ratio \(\varphi = (1 + \sqrt{5})/2\), also known as golden section. This is a number such that if \(a = \varphi b\), for some non-zero \(a\) and \(b\), then it must also be true that \(a + b = \varphi a\).
The constant is stored with float64
precision, and its value is
1.618033988749895
.
datatable.math.hypot()¶
datatable.math.inf¶
Number representing positive infinity \(\infty\). Write -inf
for
negative infinity.
datatable.math.isclose()¶
Compare two numbers x and y, and return True if they are close within the requested relative/absolute tolerance. This function only returns True/False, never NA.
More specifically, isclose(x, y) is True if either of the following are true:
x == y
(including the case when x and y are NAs),abs(x - y) <= atol + rtol * abs(y)
and neither x nor y are NA
The tolerance parameters rtol
, atol
must be positive floats,
and cannot be expressions.
datatable.math.isfinite()¶
datatable.math.isinf()¶
Returns True if the argument is +/- infinity, and False otherwise.
Note that isinf(NA) == False
.
datatable.math.isna()¶
Returns True if the argument is NA, and False otherwise.
datatable.math.ldexp()¶
Multiply x by 2 raised to the power y, i.e. compute x * 2**y
.
Column x is expected to be float, and y integer.
datatable.math.log()¶
datatable.math.log10()¶
Decimal (base-10) logarithm of x, which is \(\lg(x)\) or
\(\log_{10} x\). This function is the inverse of
pow(10, x)
.
datatable.math.log1p()¶
datatable.math.log2()¶
datatable.math.logaddexp()¶
The logarithm of the sum of exponents of x and y. This function is
equivalent to log(exp(x) + exp(y))
, but does not suffer from
catastrophic precision loss for small values of x and y.
datatable.math.logaddexp2()¶
Binary logarithm of the sum of binary exponents of x and y. This
function is equivalent to log2(exp2(x) + exp2(y))
, but does
not suffer from catastrophic precision loss for small values of
x and y.
datatable.math.nan¶
Not-a-number, a special floating-point constant that denotes a missing
number. In most datatable functions you can use None
instead
of nan
.
datatable.math.pi¶
Mathematical constant \(\pi = \frac12\tau\), also known as Archimedes’ constant, equal to the length of a semicircle with radius 1, or equivalently the arc-length of a \(180^\circ\) angle [1].
The constant is stored at float64
precision, and its value is
3.141592653589793
.
datatable.math.pow()¶
Number x raised to the power y. The return value will be float, even if the arguments x and y are integers.
This function is equivalent to x ** y
.
datatable.math.rad2deg()¶
Convert angle measured in radians into degrees: \(\operatorname{rad2deg}(x) = x\cdot\frac{360}{\tau}\).
deg2rad(x)
– convert degrees into radians.
datatable.math.round()¶
Round the values in cols
up to the specified number of the digits
of precision ndigits
. If the number of digits is omitted, rounds
to the nearest integer.
Generally, this operation is equivalent to:
rint(col * 10**ndigits) / 10**ndigits
where function rint()
rounds to the nearest integer.
FExpr
Input data for rounding. This could be an expression yielding
either a single or multiple columns. The round()
function will
apply to each column independently and produce as many columns
in the output as there were in the input.
Only numeric columns are allowed: boolean, integer or float.
An exception will be raised if cols
contains a non-numeric
column.
int
| None
The number of precision digits to retain. This parameter could be either positive or negative (or None). If positive then it gives the number of digits after the decimal point. If negative, then it rounds the result up to the corresponding power of 10.
For example, 123.45
rounded to ndigits=1
is 123.4
, whereas
rounded to ndigits=-1
it becomes 120.0
.
FExpr
f-expression that rounds the values in its first argument to the specified number of precision digits.
Each input column will produce the column of the same stype in
the output; except for the case when ndigits
is None
and
the input is either float32
or float64
, in which case an
int64
column is produced (similarly to python’s round()
).
Values that are exactly half way in between their rounded neighbors
are converted towards their nearest even value. For example, both
7.5
and 8.5
are rounded into 8
, whereas 6.5
is rounded as 6
.
Rounding integer columns may produce unexpected results for values
that are close to the min/max value of that column’s storage type.
For example, when an int8
value 127
is rounded to nearest 10
, it
becomes 130
. However, since 130
cannot be represented as int8
a
wrap-around occurs and the result becomes -126
.
Rounding an integer column to a positive ndigits
is a noop: the
column will be returned unchanged.
Rounding an integer column to a large negative ndigits
will produce
a constant all-0 column.
datatable.math.sign()¶
datatable.math.signbit()¶
datatable.math.sin()¶
Compute the trigonometric sine of angle x
measured in radians.
This function can only be applied to numeric columns (real, integer, or
boolean), and produces a float64 result, except when the argument x
is
float32, in which case the result is float32 as well.
datatable.math.sinh()¶
datatable.math.sqrt()¶
The square root of x, same as x ** 0.5
.
datatable.math.square()¶
The square of x, same as x ** 2.0
. As with all other math
functions, the result is floating-point, even if the argument
x is integer.
datatable.math.tan()¶
Compute the trigonometric tangent of x
, which is the ratio
sin(x)/cos(x)
.
This function can only be applied to numeric columns (real, integer, or
boolean), and produces a float64 result, except when the argument x
is
float32, in which case the result is float32 as well.
datatable.math.tanh()¶
datatable.math.tau¶
Mathematical constant \(\tau\), also known as a turn, equal to the circumference of a circle with a unit radius.
The constant is stored at float64
precision, and its value is 6.283185307179586
.
datatable.math.trunc()¶
The nearest integer value not greater than x
in magnitude.
If x is integer or boolean, then trunc() will return this value converted to float64. If x is floating-point, then trunc(x) acts as floor(x) for positive values of x, and as ceil(x) for negative values of x. This rounding mode is known as rounding towards zero.
datatable.models¶
Functions¶
Aggregate a frame. |
|
Perform k-fold split. |
|
Perform randomized k-fold split. |
datatable.models.Ftrl¶
Follow the Regularized Leader (FTRL) model.
FTRL model is a datatable implementation of the FTRL-Proximal online learning algorithm for binomial logistic regression. Multinomial classification and regression for continuous targets are also implemented, though these implementations are experimental. This model is fully parallel and is based on the Hogwild approach for parallelization.
The model supports datasets with the both numerical (boolean, integer and float types) and string features. To vectorize features a hashing trick is employed, such that all the values are hashed with the 64-bit hashing function. This function is implemented as follows:
for booleans and integers the hashing function is essentially an identity function;
for floats the hashing function trims mantissa, taking into account
mantissa_nbits
, and interprets the resulting bit representation as a 64-bit unsigned integer;for strings the 64-bit Murmur2 hashing function is used.
To compute the final hash x
the Murmur2 hashed feature name is added
to the hashed feature and the result is modulo divided by the number of
requested bins, i.e. by nbins
.
For each hashed row of data, according to Ad Click Prediction: a View from the Trenches, the following FTRL-Proximal algorithm is employed:

When trained, the model can be used to make predictions, or it can be re-trained on new datasets as many times as needed improving model weights from run to run.
\(\alpha\) in per-coordinate FTRL-Proximal algorithm. |
|
\(\beta\) in per-coordinate FTRL-Proximal algorithm. |
|
Column names of the training frame, i.e. features. |
|
Hashes of the column names. |
|
An option to control precision of the internal computations. |
|
Feature importances calculated during training. |
|
Feature interactions. |
|
Classification labels. |
|
L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm. |
|
L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm. |
|
Number of mantissa bits for hashing floats. |
|
The model’s |
|
A model type |
|
A model type |
|
Number of bins for the hashing trick. |
|
An option to indicate if the “negative” class should be a created for multinomial classification. |
|
Number of training epochs. |
|
All the input model parameters as a named tuple. |
Create a new Ftrl
object.
float
\(\alpha\) in per-coordinate FTRL-Proximal algorithm, should be positive.
float
\(\beta\) in per-coordinate FTRL-Proximal algorithm, should be non-negative.
float
L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.
float
L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.
int
Number of bins to be used for the hashing trick, should be positive.
int
Number of mantissa bits to take into account when hashing floats.
It should be non-negative and less than or equal to 52
, that
is a number of mantissa bits allocated for a C++ 64-bit double
.
float
Number of training epochs, should be non-negative. When nepochs
is an integer number, the model will train on all the data
provided to fit()
method nepochs
times. If nepochs
has a fractional part {nepochs}
, the model will train on all
the data [nepochs]
times, i.e. the integer part of nepochs
.
Plus, it will also perform an additional training iteration
on the {nepochs}
fraction of data.
bool
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be used
for computations. It is not guaranteed, that setting
double_precision
to True
will automatically improve
the model accuracy. It will, however, roughly double the memory
footprint of the Ftrl
object.
bool
An option to indicate if a “negative” class should be created
in the case of multinomial classification. For the “negative”
class the model will train on all the negatives, and if
a new label is encountered in the target column, its
weights will be initialized to the current “negative” class weights.
If negative_class
is set to False
, the initial weights
become zeros.
List[List[str] | Tuple[str]]
| Tuple[List[str] | Tuple[str]]
A list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. Each interaction should have at least one feature.
"binomial"
| "multinomial"
| "regression"
| "auto"
The model type to be built. When this option is "auto"
then the model type will be automatically choosen based on
the target column stype
.
FtrlParams
Named tuple of the above parameters. One can pass either this tuple, or any combination of the individual parameters to the constructor, but not both at the same time.
ValueError
The exception is raised if both the params
and one of the
individual model parameters are passed at the same time.
\(\alpha\) in per-coordinate FTRL-Proximal algorithm.
float
Current alpha
value.
float
New alpha
value, should be positive.
ValueError
The exception is raised when newalpha
is not positive.
\(\beta\) in per-coordinate FTRL-Proximal algorithm.
float
Current beta
value.
float
New beta
value, should be non-negative.
ValueError
The exception is raised when newbeta
is negative.
Column names of the training frame, i.e. the feature names.
List[str]
A list of the column names.
colname_hashes
– the hashed column names.
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be
used for computations. This option is read-only and can only be set
during the Ftrl
object construction
.
bool
Current double_precision
value.
Feature importances as calculated during the model training and
normalized to [0; 1]
. The normalization is done by dividing
the accumulated feature importances over the maximum value.
Frame
A frame with two columns: feature_name
that has stype str32
,
and feature_importance
that has stype float32
or float64
depending on whether the double_precision
option is False
or True
.
Train FTRL model on a dataset.
Frame
Training frame.
Frame
Target frame having as many rows as X_train
and one column.
Frame
Validation frame having the same number of columns as X_train
.
Frame
Validation target frame of shape (nrows, 1)
.
float
Parameter that specifies how often, in epoch units, validation error should be checked.
float
The improvement of the relative validation error that should be
demonstrated by the model within nepochs_validation
epochs,
otherwise the training will stop.
int
Number of iterations that is used to average the validation error.
Each iteration corresponds to nepochs_validation
epochs.
FtrlFitOutput
FtrlFitOutput
is a Tuple[float, float]
with two fields: epoch
and loss
,
representing the final fitting epoch and the final loss, respectively.
If validation dataset is not provided, the returned epoch
equals to
nepochs
and the loss
is just float('nan')
.
The feature interactions to be used for model training. This option is read-only for a trained model.
Tuple
Current interactions
value.
List[List[str] | Tuple[str]]
| Tuple[List[str] | Tuple[str]]
New interactions
value. Each particular interaction
should be a list or a tuple of feature names, where each feature
name is a column name from the training frame.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
one of the interactions has zero features.
TypeError
The exception is raised when newinteractions
has a wrong type.
Classification labels the model was trained on.
Frame
A one-column frame with the classification labels. In the case of the numeric regression the label is the target column name.
L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm.
float
Current lambda1
value.
float
New lambda1
value, should be non-negative.
ValueError
The exception is raised when newlambda1
is negative.
L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm.
float
Current lambda2
value.
float
New lambda2
value, should be non-negative.
ValueError
The exception is raised when newlambda2
is negative.
Trained models weights, i.e. z
and n
coefficients
in per-coordinate FTRL-Proximal algorithm.
A type of the model Ftrl
should build:
"binomial"
for binomial classification;"multinomial"
for multinomial classification;"regression"
for numeric regression;"auto"
for automatic model type detection based on the target columnstype
.
This option is read-only for a trained model.
str
Current model_type
value.
"binomial"
| "multinomial"
| "regression"
| "auto"
New model_type
value.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
newmodel_type
value is not one of the following:"binomial"
,"multinomial"
,"regression"
or"auto"
.
.model_type_trained
– the model typeFtrl
has build.
The model type Ftrl
has built.
str
Could be one of the following: "regression"
, "binomial"
,
"multinomial"
or "none"
for untrained model.
.model_type
– the model typeFtrl
should build.
Number of mantissa bits to take into account for hashing floats. This option is read-only for a trained model.
int
Current mantissa_nbits
value.
int
New mantissa_nbits
value, should be non-negative and
less than or equal to 52
, that is a number of
mantissa bits in a C++ 64-bit double
.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
newmantissa_nbits
value is negative or larger than52
.
Number of bins to be used for the hashing trick. This option is read-only for a trained model.
int
Current nbins
value.
int
New nbins
value, should be positive.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
newnbins
value is not positive.
An option to indicate if a “negative” class should be created
in the case of multinomial classification. For the “negative”
class the model will train on all the negatives, and if
a new label is encountered in the target column, its
weights are initialized to the current “negative” class weights.
If negative_class
is set to False
, the initial weights
become zeros.
This option is read-only for a trained model.
bool
Current negative_class
value.
bool
New negative_class
value.
ValueError
The exception is raised when trying to change this option for a model that has already been trained.
TypeError
The exception is raised when newnegative_class
is not bool
.
Number of training epochs. When nepochs
is an integer number,
the model will train on all the data provided to fit()
method
nepochs
times. If nepochs
has a fractional part {nepochs}
,
the model will train on all the data [nepochs]
times,
i.e. the integer part of nepochs
. Plus, it will also perform an additional
training iteration on the {nepochs}
fraction of data.
float
Current nepochs
value.
float
New nepochs
value, should be non-negative.
ValueError
The exception is raised when newnepochs
value is negative.
Ftrl
model parameters as a named tuple FtrlParams
,
see Ftrl.__init__()
for more details.
This option is read-only for a trained model.
FtrlParams
Current params
value.
FtrlParams
New params
value.
ValueError
The exception is raised when
trying to change this option for a model that has alerady been trained;
individual parameter values are incompatible with the corresponding setters.
Make predictions for a dataset.
datatable.models.aggregate()¶
Aggregate a frame into clusters. Each cluster consists of a set of members, i.e. a subset of the input frame, and is represented by an exemplar, i.e. one of the members.
For one- and two-column frames the aggregation is based on the standard equal-interval binning for numeric columns, and grouping for string columns.
When the input frame has more columns than two, a parallel one-pass Ad-Hoc algorithm is employed, see description of Aggregator<T>::group_nd() method for more details. This algorithm takes into account the numeric columns only, and all the string columns are ignored.
Frame
The input frame containing numeric or string columns.
int
Number of bins for 1D aggregation.
int
Number of bins for the first column for 2D aggregation.
int
Number of bins for the second column for 2D aggregation.
int
Maximum number of exemplars for ND aggregation. It is guaranteed
that the ND algorithm will return less than nd_max_bins
exemplars,
but the exact number may vary from run to run due to parallelization.
int
Number of columns at which the projection method is used for ND aggregation.
int
Seed to be used for the projection method.
bool
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be used
for computations.
float
Fixed radius for ND aggregation, use it with caution.
If set, nd_max_bins
will have no effect and in the worst
case number of exemplars may be equal to the number of rows
in the data. For big data this may result in extremly large
execution times. Since all the columns are normalized to [0, 1)
,
the fixed_radius
value should be choosen accordingly.
Tuple[Frame, Frame]
The first element in the tuple is the aggregated frame, i.e.
the frame containing exemplars, with the shape of
(nexemplars, frame.ncols + 1)
, where nexemplars
is
the number of gathered exemplars. The first frame.ncols
columns
are the columns from the input frame, and the last column
is the members_count
that has stype int32
containing
number of members per exemplar.
The second element in the tuple is the members frame with the shape of
(frame.nrows, 1)
, each row in this frame corresponds to the
row with the same id in the input frame
. The only column exemplar_id
has an stype of int32
and contains the exemplar ids a particular
member belongs to. These ids are effectively the ids of
the exemplar’s rows from the input frame.
ValueError
The exception is raised if the input frame is missing.
TypeError
The exception is raised when one of the frame
’s columns has an
unsupported stype, i.e. the column is both non-numeric and non-string.
datatable.models.kfold()¶
Perform k-fold split of data with nrows
rows into nsplits
train/test
subsets. The dataset itself is not passed to this function:
it is sufficient to know only the number of rows in order to decide
how the data should be split.
The range [0; nrows)
is split into nsplits
approximately equal parts,
i.e. folds, and then each i
-th split will use the i
-th fold as a
test part, and all the remaining rows as the train part. Thus, i
-th split is
comprised of:
train rows:
[0; i*nrows/nsplits) + [(i+1)*nrows/nsplits; nrows)
;test rows:
[i*nrows/nsplits; (i+1)*nrows/nsplits)
.
where integer division is assumed.
int
The number of rows in the frame that is going to be split.
int
Number of folds, must be at least 2
, but not larger than nrows
.
List[Tuple]
This function returns a list of nsplits
tuples (train_rows, test_rows)
,
where each component of the tuple is a rows selector that can be applied
to any frame with nrows
rows to select the desired folds.
Some of these row selectors will be simple python ranges,
others will be single-column Frame objects.
kfold_random()
– Perform randomized k-fold split.
datatable.models.kfold_random()¶
Perform randomized k-fold split of data with nrows
rows into
nsplits
train/test subsets. The dataset itself is not passed to this
function: it is sufficient to know only the number of rows in order to decide
how the data should be split.
The train/test subsets produced by this function will have the following properties:
all test folds will be of approximately the same size
nrows/nsplits
;all observations have equal ex-ante chance of getting assigned into each fold;
the row indices in all train and test folds will be sorted.
The function uses single-pass parallelized algorithm to construct the folds.
int
The number of rows in the frame that you want to split.
int
Number of folds, must be at least 2
, but not larger than nrows
.
int
Seed value for the random number generator used by this function. Calling the function several times with the same seed values will produce same results each time.
datatable.options¶
datatable.options.debug¶
This namespace contains the following debug options:
The number of characters to use per a function/method argument. |
|
Switch that enables logging of the debug information. |
|
The custom logger object. |
|
Switch that enables logging of the function/method arguments. |
datatable.options.debug.arg_max_size¶
When the debug.report_args
is
True
, this option will limit the display size of each argument in order
to prevent potentially huge outputs. This option’s value
cannot be less than 10
.
datatable.options.debug.enabled¶
If True
, all calls to datatable core functions will be logged,
together with their timings.
datatable.options.debug.logger¶
The logger object used for reporting calls to datatable core
functions. If None
, then the default (built-in) logger will
be used. This option has no effect if
debug.enabled
is False
.
datatable.options.debug.report_args¶
Controls whether log messages about function and method calls contain information about the arguments of those calls.
datatable.options.display¶
This namespace contains the following display options:
Switch that controls if the unicode characters are allowed. |
|
The number of top rows to display when the frame view is truncated. |
|
Switch that controls if the interactive view is enabled or not. |
|
The threshold for the column’s width to be truncated. |
|
The threshold for the number of rows in a frame to be truncated. |
|
The number of bottom rows to display when the frame view is truncated. |
|
Switch that controls if colors should be used in the console. |
datatable.options.display.allow_unicode¶
If True
, datatable will allow unicode characters (encoded as
UTF-8) to be printed into the output.
If False
, then unicode characters will either be avoided, or
hex-escaped as necessary.
datatable.options.display.head_nrows¶
The number of rows from the top of a frame to be displayed when
the frame’s output is truncated due to the total number of frame’s
rows exceeding display.max_nrows
value.
datatable.options.display.interactive¶
This option controls the behavior of a Frame when it is viewed in a
text console. When True
, the Frame will be shown in the interactove
mode, allowing you to navigate the rows/columns with keyboard.
When False
, the Frame will be shown in regular, non-interactive mode
(you can still call DT.view()
to enter the interactive mode manually.
datatable.options.display.max_column_width¶
A column’s name or values that exceed max_column_width
in size
will be truncated. This option applies both to rendering a frame
in a terminal, and to rendering in a Jupyter notebook. The
smallest allowed max_column_width
is 2
.
Setting the value to None
indicates that the
column’s content should never be truncated.
datatable.options.display.max_nrows¶
A frame with more rows than this will be displayed truncated
when the frame is printed to the console: only its first
display.head_nrows
and last
display.tail_nrows
rows will be printed. It is recommended to have
head_nrows + tail_nrows <= max_nrows
.
Setting this option to None (or a negative value) will cause all
rows in a frame to be printed, which may cause the console to become
unresponsive.
datatable.options.display.tail_nrows¶
The number of rows from the bottom of a frame to be displayed when
the frame’s output is truncated due to the total number of frame’s
rows exceeding display.max_nrows
value.
datatable.options.display.use_colors¶
Whether to use colors when printing various messages into the console. Turn this off if your terminal is unable to display ANSI escape sequences, or if the colors make output not legible.
datatable.options.frame¶
This namespace contains the following Frame
options:
Initial value of the default column name index. |
|
Default column name prefix. |
datatable.options.frame.names_auto_index¶
When Frame needs to auto-name columns, they will be assigned
names C0
, C1
, C2
, etc. by default. This option allows you to
control the starting index in this sequence. For example, setting
dt.options.frame.names_auto_index=1
will cause the columns to be
named C1
, C2
, C3
, etc.
Name mangling – tutorial on name mangling.
datatable.options.frame.names_auto_prefix¶
When Frame needs to auto-name columns, they will be assigned
names C0
, C1
, C2
, etc. by default. This option allows you to
control the prefix used in this sequence. For example, setting
dt.options.frame.names_auto_prefix='Z'
will cause the columns to be
named Z0
, Z1
, Z2
, etc.
Name mangling – tutorial on name mangling.
datatable.options.fread¶
datatable.options.fread.log¶
This property controls the following logging options:
Switch that controls if the logs should be anonymized. |
|
Switch that controls if the unicode characters should be escaped. |
If True
, any snippets of data being read that are printed in the
log will be first anonymized by converting all non-0 digits to 1
,
all lowercase letters to a
, all uppercase letters to A
, and all
unicode characters to U
.
This option is useful in production systems when reading sensitive data that must not accidentally leak into log files or be printed with the error messages.
If True
, all unicode characters in the verbose log will be written
in hexadecimal notation. Use this option if your terminal cannot
print unicode, or if the output gets somehow corrupted because of
the unicode characters.
datatable.options.progress¶
This namespace contains the following progress reporting options:
Switch that controls if the datatable tasks could be interrupted. |
|
A custom progress-reporting function. |
|
Switch that controls if the progress bar is cleared on success. |
|
Switch that controls if the progress reporting is enabled. |
|
The minimum duration of a task to show the progress bar. |
|
The progress bar update frequency. |
datatable.options.progress.allow_interruption¶
If True
, allow datatable to handle the SIGINT
signal to interrupt
long-running tasks.
datatable.options.progress.callback¶
If None
, then the built-in progress-reporting function will be used.
Otherwise, this value specifies a function to be called at each
progress event. The function takes a single parameter p
, which is
a namedtuple with the following fields:
p.progress
is a float in the range0.0 .. 1.0
;
p.status
is a string, one of ‘running’, ‘finished’, ‘error’ or ‘cancelled’; and
p.message
is a custom string describing the operation currently being performed.
datatable.options.progress.clear_on_success¶
If True
, clear progress bar when job finished successfully.
datatable.options.progress.enabled¶
When False
, progress reporting functionality will be turned off.
This option is True
by default if the stdout
is connected to a
terminal or a Jupyter Notebook, and False otherwise.
datatable.options.progress.min_duration¶
Do not show progress bar if the duration of an operation is smaller than this value. If this setting is non-zero, then the progress bar will only be shown for long-running operations, whose duration (estimated or actual) exceeds this threshold.
datatable.options.progress.updates_per_second¶
Number of times per second the display of the progress bar should be updated.
datatable.options.nthreads¶
The number of threads used by datatable internally.
Many calculations in datatable
module are parallelized. This
setting controls how many threads will be used during such
calculations.
Initially, this option is set to the value returned by C++ call
std::thread::hardware_concurrency()
. This is usually equal to the
number of available cores.
You can set nthreads
to a value greater or smaller than the
initial setting. For example, setting nthreads = 1
will force the
library into a single-threaded mode. Setting nthreads
to 0
will
restore the initial value equal to the number of processor cores.
Setting nthreads
to a value less than 0
is equivalent to
requesting that fewer threads than the maximum.
datatable.FExpr¶
FExpr is an object that encapsulates computations to be done on a frame.
FExpr objects are rarely constructed directly (though it is possible too),
instead they are more commonly created as inputs/outputs from various
functions in datatable
.
Consider the following example:
math.sin(2 * f.Angle)
Here accessing column “Angle” in namespace f
creates an FExpr
.
Multiplying this FExpr
by a python scalar 2
creates a new FExpr
.
And finally, applying the sine function creates yet another FExpr
. The
resulting expression can be applied to a frame via the
DT[i,j]
method, which will compute that expression
using the data of that particular frame.
Thus, an FExpr
is a stored computation, which can later be applied to a
Frame, or to multiple frames.
Because of its delayed nature, an FExpr
checks its correctness at the time
when it is applied to a frame, not sooner. In particular, it is possible for
the same expression to work with one frame, but fail with another. In the
example above, the expression may raise an error if there is no column named
“Angle” in the frame, or if the column exists but has non-numeric type.
Most functions in datatable that accept an FExpr
as an input, return
a new FExpr
as an output, thus creating a tree of FExpr
s as the
resulting evaluation graph.
Also, all functions that accept FExpr
s as arguments, will also accept
certain other python types as an input, essentially converting them into
FExpr
s. Thus, we will sometimes say that a function accepts FExpr-like
objects as arguments.
All binary operators op(x, y)
listed below work when either x
or y
, or both are FExpr
s.
Construction¶
Create an |
|
Append another FExpr. |
|
Remove columns from the FExpr. |
Arithmeritc operators¶
Addition |
|
Subtraction |
|
Multiplication |
|
Division |
|
Integer division |
|
Modulus |
|
Power |
|
Unary plus |
|
Unary minus |
Bitwise operators¶
Bitwise AND |
|
Bitwise OR |
|
Bitwise XOR |
|
Bitwise NOT |
|
Left shift |
|
Right shift |
Relational operators¶
Equal |
|
Not equal |
|
Less than |
|
Less than or equal |
|
Greater than |
|
Greater than or equal |
datatable.FExpr.__add__()¶
Add two FExprs together, which corresponds to python operator +
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the +
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The result of adding two columns with different stypes will have the following stype:
max(x.stype, y.stype, int32)
if both columns are numeric (i.e. bool, int or float);
str32
/str64
if at least one of the columns is a string. In this case the+
operator implements string concatenation, same as in Python.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x + y
.
datatable.FExpr.__and__()¶
Compute bitwise AND of x
and y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the &
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The AND operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.
When both x
and y
are boolean, then the bitwise AND operator is
equivalent to logical AND. This can be used to combine several logical
conditions into a compound (since Python doesn’t allow overloading of
operator and
). Beware, however, that &
has higher precedence
than and
, so it is advisable to always use parentheses:
DT[(f.x >= 0) & (f.x <= 1), :]
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x & y
.
Warning
Use x & y
in order to AND two boolean FExpr
s. Using standard
Python keyword and
will result in an error.
datatable.FExpr.__bool__()¶
Using this operator will result in a TypeError
.
The boolean-cast operator is used by Python whenever it wants to
know whether the object is equivalent to a single True
or False
value. This is not applicable for a FExpr
, which represents
stored computation on a column or multiple columns. As such, an
error is raised.
In order to convert a column into the boolean stype, you can use the
type-cast operator dt.bool8(x)
.
datatable.FExpr.__eq__()¶
Compare whether values in columns x
and y
are equal.
Like all other FExpr operators, the equality operator is elementwise:
it produces a column where each element is the result of comparison
x[i] == y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the ==
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The equality operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison. In practice it means, for example, that `1 == "1"
.
Lastly, the comparison x == None
is exactly equivalent to the
isna()
function.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x == y
. The produced column will
have stype bool8
.
datatable.FExpr.__floordiv__()¶
Perform integer division of two FExprs, i.e. x // y
.
The modulus and integer division together satisfy the identity
that x == (x // y) * y + (x % y)
for all non-zero values of y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the //
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The integer division operation can only be applied to integer columns.
The resulting column will have stype equal to the largest of the stypes
of both columns, but at least int32
.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x // y
.
datatable.FExpr.__ge__()¶
Compare whether x >= y
.
Like all other FExpr operators, the greater-than-or-equal operator is
elementwise: it produces a column where each element is the result of
comparison x[i] >= y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the >=
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The greater-than-or-equal operator can be applied to columns of any type,
and the types of x
and y
are allowed to be different. In the
latter case the columns will be converted into a common stype before the
comparison.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x >= y
. The produced column will
have stype bool8
.
datatable.FExpr.__gt__()¶
Compare whether x > y
.
Like all other FExpr operators, the greater-than operator is elementwise:
it produces a column where each element is the result of comparison
x[i] > y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the >
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The greater-than operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x > y
. The produced column will
have stype bool8
.
datatable.FExpr.__init__()¶
Create a new FExpr
object out of e
.
The FExpr
is serve as a simple wrapper of the underlying object,
allowing it to be combined with othef FExpr
s.
This constructor is almost never needs to be run manually by the user.
None
| bool
| int
| str
| float
| slice
| list
| tuple
| dict
| type
| stype
| ltype
| Generator
| FExpr
| Frame
| range
| pd.DataFrame
| pd.Series
| np.array
| np.ma.masked_array
The argument that will be converted into an FExpr
.
datatable.FExpr.__invert__()¶
Compute bitwise NOT of x
, which corresponds to python operation ~x
.
If x
is a multi-column expressions, then the ~
operator will be
applied to each column in turn.
Bitwise NOT can only be applied to integer or boolean columns. The resulting column will have the same stype as its argument.
When the argument x
is a boolean column, then ~x
is equivalent to
logical NOT. This can be used to negate a condition, similar to python
operator not
(which is not overloadable).
FExpr
Either an FExpr
, or any object that can be converted into FExpr
.
FExpr
An expression that evaluates ~x
.
Warning
Use ~x
in order to negate a boolean FExpr. Using standard Python
keyword not
will result in an error.
datatable.FExpr.__le__()¶
Compare whether x <= y
.
Like all other FExpr operators, the less-than-or-equal operator is
elementwise: it produces a column where each element is the result of
comparison x[i] <= y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the <=
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The less-than-or-equal operator can be applied to columns of any type,
and the types of x
and y
are allowed to be different. In the
latter case the columns will be converted into a common stype before the
comparison.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x <= y
. The produced column will
have stype bool8
.
datatable.FExpr.__lshift__()¶
Shift x
by y
bits to the left, i.e. x << y
. Mathematically this
is equivalent to \(x\cdot 2^y\).
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the <<
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x << y
.
__rshift__(x, y)
– right shift.
datatable.FExpr.__lt__()¶
Compare whether x < y
.
Like all other FExpr operators, the less-than operator is elementwise:
it produces a column where each element is the result of comparison
x[i] < y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the <
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The less-than operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x < y
. The produced column will
have stype bool8
.
datatable.FExpr.__mod__()¶
Compute the remainder of division of two FExprs, i.e. x % y
.
The modulus and integer division together satisfy the identity
that x == (x // y) * y + (x % y)
for all non-zero values of y
.
In addition, the result of x % y
is always in the range
[0; y)
for positive y
, and in the range (y; 0]
for
negative y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the %
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The integer division operation can only be applied to integer columns.
The resulting column will have stype equal to the largest of the stypes
of both columns, but at least int32
.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x % y
.
datatable.FExpr.__mul__()¶
Multiply two FExprs together, which corresponds to python operator *
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the *
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The multiplication operation can only be applied to numeric columns. The
resulting column will have stype equal to the larger of the stypes of its
arguments, but at least int32
.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x * y
.
datatable.FExpr.__ne__()¶
Compare whether values in columns x
and y
are not equal.
Like all other FExpr operators, the equality operator is elementwise:
it produces a column where each element is the result of comparison
x[i] != y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the !=
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The inequality operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x != y
. The produced column will
have stype bool8
.
datatable.FExpr.__neg__()¶
Unary minus, which corresponds to python operation -x
.
If x
is a multi-column expressions, then the -
operator will be
applied to each column in turn.
Unary minus can only be applied to numeric columns. The resulting column
will have the same stype as its argument, but not less than int32
.
FExpr
Either an FExpr
, or any object that can be converted into FExpr
.
FExpr
An expression that evaluates -x
.
datatable.FExpr.__or__()¶
Compute bitwise OR of x
and y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the |
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The OR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.
When both x
and y
are boolean, then the bitwise OR operator is
equivalent to logical OR. This can be used to combine several logical
conditions into a compound (since Python doesn’t allow overloading of
operator or
). Beware, however, that |
has higher precedence
than or
, so it is advisable to always use parentheses:
DT[(f.x < -1) | (f.x > 1), :]
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x | y
.
Warning
Use x | y
in order to OR two boolean FExpr
s. Using standard
Python keyword or
will result in an error.
datatable.FExpr.__pos__()¶
Unary plus, which corresponds to python operation +x
.
If x
is a multi-column expressions, then the +
operator will be
applied to each column in turn.
Unary plus can only be applied to numeric columns. The resulting column
will have the same stype as its argument, but not less than int32
.
FExpr
Either an FExpr
, or any object that can be converted into FExpr
.
FExpr
An expression that evaluates +x
.
datatable.FExpr.__pow__()¶
Raise x
to the power y
, or in math notation \(x^y\).
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the **
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The power operator can only be applied to numeric columns, and the
resulting column will have stype float64
in all cases except when both
arguments are float32
(in which case the result is also float32
).
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x ** y
.
datatable.FExpr.__repr__()¶
Return string representation of this object. This method is used
by Python’s built-in function repr()
.
The returned string has the following format:
"FExpr<...>"
where ...
will attempt to match the expression used to construct
this FExpr
.
datatable.FExpr.__rshift__()¶
Shift x
by y
bits to the right, i.e. x >> y
. Mathematically this
is equivalent to \(\lfloor x\cdot 2^{-y} \rfloor\).
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the >>
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x >> y
.
__lshift__(x, y)
– left shift.
datatable.FExpr.__sub__()¶
Subtract two FExprs, which corresponds to python operation x - y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the -
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The subtraction operation can only be applied to numeric columns. The
resulting column will have stype equal to the larger of the stypes of its
arguments, but at least int32
.
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x - y
.
datatable.FExpr.__truediv__()¶
Divide two FExprs, which corresponds to python operation x / y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the /
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The division operation can only be applied to numeric columns. The
resulting column will have stype float64
in all cases except when both
arguments have stype float32
(in which case the result is also
float32
).
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x / y
.
datatable.FExpr.__xor__()¶
Compute bitwise XOR of x
and y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the ^
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The XOR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.
When both x
and y
are boolean, then the bitwise XOR operator is
equivalent to logical XOR. This can be used to combine several logical
conditions into a compound (since Python doesn’t allow overloading of
operator xor
). Beware, however, that ^
has higher precedence
than xor
, so it is advisable to always use parentheses:
DT[(f.x == 0) ^ (f.y == 0), :]
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x ^ y
.
datatable.FExpr.extend()¶
Append FExpr
arg
to the current FExpr.
Each FExpr
represents a collection of columns, or a columnset. This
method takes two such columnsets and combines them into a single one,
similar to cbind()
.
datatable.FExpr.len()¶
Deprecated since version 0.11.
Return the string length for a string column. This method can only be applied to string columns, and it returns an integer column as a result.
Since version 1.0 this function will be available in the str.
module.
datatable.FExpr.re_match()¶
Deprecated since version 0.11.
Test whether values in a string column match a regular expression.
Since version 1.0 this function will be available in the re.
submodule.
datatable.FExpr.remove()¶
Remove columns arg
from the current FExpr.
Each FExpr
represents a collection of columns, or a columnset. Some
of those columns are computed while others are specified “by reference”,
for example f.A
, f[:3]
or f[int]
. This method allows you to
remove by-reference columns from an existing FExpr.
datatable.Frame¶
Two-dimensional column-oriented container of data. This the primary
data structure in the datatable
module.
A Frame is two-dimensional in the sense that it is comprised of
rows and columns of data. Each data cell can be located via a pair
of its coordinates: (irow, icol)
. We do not support frames with
more or less than two dimensions.
A Frame is column-oriented in the sense that internally the data is stored separately for each column. Each column has its own name and type. Types may be different for different columns but cannot vary within each column.
Thus, the dimensions of a Frame are not symmetrical: a Frame is not a matrix. Internally the class is optimized for the use case when the number of rows significantly exceeds the number of columns.
A Frame can be viewed as a list
of columns: standard Python
function len()
will return the number of columns in the Frame,
and frame[j]
will return the column at index j
(each “column”
will be a Ffame with ncols == 1
). Similarly, you can iterate over
the columns of a Frame in a loop, or use it in a *
-expansion:
for column in frame:
...
list_of_columns = [*frame]
A Frame can also be viewed as a dict
of columns, where the key
associated with each column is its name. Thus, frame[name]
will
return the column with the requested name. A Frame can also work with
standard python **
-expansion:
dict_of_columns = {**frame}
Construction¶
Construct the frame from various Python sources. |
|
Read an external file and convert into a Frame. |
|
Create a copy of the frame. |
Frame manipulation¶
Primary method for extracting data from a frame. |
|
Update data within the frame. |
|
Remove rows/columns/values from the frame. |
|
Append columns of other frames to this frame. |
|
Append other frames at the bottom of the current. |
|
Search and replace values in the frame. |
|
Sort the frame by the specified columns. |
Convert into other formats¶
Write the frame’s data into CSV format. |
|
Convert the frame into a Python dictionary, by columns. |
|
Store the frame’s data into a binary file in Jay format. |
|
Return the frame’s data as a list of lists, by columns. |
|
Convert the frame into a numpy array. |
|
Convert the frame into a pandas DataFrame. |
|
Return the frame’s data as a list of tuples, by rows. |
Properties¶
The primary key for the Frame, if any. |
|
Logical types ( |
|
The names of all columns in the frame. |
|
Number of columns in the frame. |
|
Number of rows in the frame. |
|
A tuple (number of rows, number of columns). |
|
Where this frame was loaded from. |
|
The common |
|
Storage types ( |
Other methods¶
Find the position of a column in the frame by its name. |
|
Create python variables for each column of the frame. |
|
Return the first few rows of the frame. |
|
Make sure all frame’s data is physically written to memory. |
|
Return the last few rows of the frame. |
Special methods¶
These methods are not intended to be called manually, instead they provide
a way for datatable
to interoperate with other Python modules or
builtin functions.
Used by Python module |
|
Used by Python module |
|
Method that implements the |
|
Method that implements the |
|
Used by Python module |
|
The constructor function. |
|
Used by Python function |
|
Used by Python function |
|
Used by Python function |
|
Used by Python function |
|
Method that implements the |
|
Used by Python module |
|
Used by |
|
Used by Python function |
|
|
Used to display the frame in Jupyter Lab. |
|
Used to display the frame in an |
datatable.Frame.__init__()¶
Create a new Frame from a single or multiple sources.
Argument _data
(or **cols
) contains the source data for Frame’s
columns. Column names are either derived from the data, given
explicitly via the names
argument, or generated automatically.
Either way, the constructor ensures that column names are unique,
non-empty, and do not contain certain special characters (see
Name mangling for details).
Any
The first argument to the constructor represents the source from
which to construct the Frame. If this argument is given, then the
varkwd arguments **cols
should not be used.
This argument can accept a wide range of data types; see the “Details” section below.
List[str|None]
Explicit list (or tuple) of column names. The number of elements in the list must be the same as the number of columns being constructed.
This parameter should not be used when constructing the frame
from **cols
.
List[stype-like]
| Dict[str, stype-like]
Explicit list (or tuple) of column types. The number of elements in the list must be the same as the number of columns being constructed.
Frame
A Frame
object is constructed and
returned.
The shape of the constructed Frame depends on the type of the source
argument _data
(or **cols
). The argument _data
and varkwd
arguments **cols
are mutually exclusive: they cannot be used at the
same time. However, it is possible to use neither and construct an
empty frame:
dt.Frame() # empty 0x0 frame
dt.Frame(None) # same
dt.Frame([]) # same
The varkwd arguments **cols
can be used to construct a Frame by
columns. In this case the keys become column names, and the values
are column initializers. This form is mostly used for convenience,
it is equivalent to converting cols
into a dict
and passing as
the first argument:
dt.Frame(A = range(7),
B = [0.1, 0.3, 0.5, 0.7, None, 1.0, 1.5],
C = ["red", "orange", "yellow", "green", "blue", "indigo", "violet"])
# equivalent to
dt.Frame({"A": range(7), "B": [0.1, 0.3, ...], "C": ["red", "orange", ...]})
The argument _data
accepts a wide range of input types. The
following list describes possible choices:
List[List | Frame | np.array | pd.DataFrame | pd.Series | range | typed_list]
When the source is a non-empty list containing other lists or compound objects, then each item will be interpreted as a column initializer, and the resulting frame will have as many columns as the number of items in the list.
Each element in the list must produce a single column. Thus, it is not allowed to use multi-column
Frame
s, or multi-dimensional numpy arrays or pandasDataFrame
s.>>> dt.Frame([[1, 3, 5, 7, 11], ... [12.5, None, -1.1, 3.4, 9.17]]) | C0 C1 -- + -- ----- 0 | 1 12.5 1 | 3 NA 2 | 5 -1.1 3 | 7 3.4 4 | 11 9.17 -- [5 rows x 2 columns]
Note that unlike
pandas
andnumpy
, we treat a list of lists as a list of columns, not a list of rows. If you need to create a Frame from a row-oriented store of data, you can use a list of dictionaries or a list of tuples as described below.List[Dict]
If the source is a list of
dict
objects, then each element in this list is interpreted as a single row. The keys in each dictionary are column names, and the values contain contents of each individual cell.The rows don’t have to have the same number or order of entries: all missing elements will be filled with NAs:
>>> dt.Frame([{"A": 3, "B": 7}, ... {"A": 0, "B": 11, "C": -1}, ... {"C": 5}]) | A B C -- + -- -- -- 0 | 3 7 NA 1 | 0 11 -1 2 | NA NA 5 -- [3 rows x 3 columns]
If the
names
parameter is given, then only the keys given in the list of names will be taken into account, all extra fields will be discarded.List[Tuple]
If the source is a list of
tuple
s, then each tuple represents a single row. The tuples must have the same size, otherwise an exception will be raised:>>> dt.Frame([(39, "Mary"), ... (17, "Jasmine"), ... (23, "Lily")], names=['age', 'name']) | age name -- + --- ------- 0 | 39 Mary 1 | 17 Jasmine 2 | 23 Lily -- [3 rows x 2 columns]
If the tuples are in fact
namedtuple
s, then the field names will be used for the column names in the resulting Frame. No check is made whether the named tuples in fact belong to the same class.List[Any]
If the list’s first element does not match any of the cases above, then it is considered a “list of primitives”. Such list will be parsed as a single column.
The entries are typically
bool
s,int
s,float
s,str
s, orNone
s; numpy scalars are also allowed. If the list has elements of heterogeneous types, then we will attempt to convert them to the smallest common stype.If the list contains only boolean values (or
None
s), then it will create a column of typebool8
.If the list contains only integers (or
None
s), then the resulting column will beint8
if all integers are 0 or 1; orint32
if all entries are less than \(2^{31}\) in magnitude; otherwiseint64
if all entries are less than \(2^{63}\) in magnitude; or otherwisefloat64
.If the list contains floats, then the resulting column will have stype
float64
. BothNone
andmath.nan
can be used to input NA values.Finally, if the list contains strings then the column produced will have stype
str32
if the total size of the character is less than 2Gb, orstr64
otherwise.typed_list
A typed list can be created by taking a regular list and dividing it by an stype. It behaves similarly to a simple list of primitives, except that it is parsed into the specific stype.
>>> dt.Frame([1.5, 2.0, 3.87] / dt.float32).stype stype.float32
Dict[str, Any]
The keys are column names, and values can be any objects from which a single-column frame can be constructed: list, range, np.array, single-column Frame, pandas series, etc.
Constructing a frame from a dictionary
d
is exactly equivalent to callingdt.Frame(list(d.values()), names=list(d.keys()))
.range
Same as if the range was expanded into a list of integers, except that the column created from a range is virtual and its creation time is nearly instant regardless of the range’s length.
Frame
If the argument is a
Frame
, then a shallow copy of that frame will be created, same ascopy()
.str
If the source is a simple string, then the frame is created by
fread
-ing this string. In particular, if the string contains the name of a file, the data will be loaded from that file; if it is a URL, the data will be downloaded and parsed from that URL. Lastly, the string may simply contain a table of data.>>> DT1 = dt.Frame("train.csv") >>> DT2 = dt.Frame(""" ... Name Age ... Mary 39 ... Jasmine 17 ... Lily 23 ... """)
pd.DataFrame | pd.Series
A pandas DataFrame (Series) will be converted into a datatable Frame. Column names will be preserved.
Column types will generally be the same, assuming they have a corresponding stype in datatable. If not, the column will be converted. For example, pandas date/time column will get converted into string, while
float16
will be converted intofloat32
.If a pandas frame has an object column, we will attempt to refine it into a more specific stype. In particular, we can detect a string or boolean column stored as object in pandas.
np.array
A numpy array will get converted into a Frame of the same shape (provided that it is 2- or less- dimensional) and the same type.
If possible, we will create a Frame without copying the data (however, this is subject to numpy’s approval). The resulting frame will have a copy-on-write semantics.
None
When the source is not given at all, then a 0x0 frame will be created; unless a
names
parameter is provided, in which case the resulting frame will have 0 rows but as many columns as given in thenames
list.
datatable.Frame.__copy__()¶
This method facilitates copying of a Frame via the python standard module
copy
. See copy()
for more details.
datatable.Frame.__delitem__()¶
This methods deletes rows and columns that would have been selected from
the frame if not for the del
keyword.
All parameters have the same meaning as in the getter
DT[i, j, ...]
, with the only
restriction that j
must select columns from the main frame only (i.e. not
from the joined frame(s)), and it must select them by reference. Selecting
by reference means it should be possible to tell where each column was in
the original frame.
There are several modes of delete operation, depending on whether i
or
j
are “slice-all” symbols:
datatable.Frame.__getitem__()¶
The main method for accessing data and computing on the frame. Sometimes
we also refer to it as the DT[i, j, ...]
call.
Since Python does not support keyword arguments inside square brackets,
all arguments are positional. The first is the row selector i
, the
second is the column selector j
, and the rest are optional. Thus,
DT[i, j]
selects rows i
and columns j
from frame DT
.
If an additional by
argument is present, then the selectors i
and
j
work within groups generated by the by()
expression. The sort
argument reorders the rows of the frame, and the join
argument allows
performing SQL joins between several frames.
The signature listed here is the most generic. But there are also
special-case signatures DT[j]
and DT[i, j]
described below.
int
| slice
| Frame
| FExpr
| List[bool]
| List[Any]
The row selector.
If this is an integer or a slice, then the behavior is the same as in
Python when working on a list with nrows
elements. In particular, the integer value must be within the range
[-nrows; nrows)
. On the other hand when i
is a slice, then either
its start or end or both may be safely outside the row-range of the
frame. The trivial slice :
always selects all rows.
i
may also be a single-column boolean Frame. It must have the
same number of rows as the current frame, and it serves as a mask for
which rows are to be selected: True
indicates that the row should
be included in the result, while False
and None
skips the row.
i
may also be a single-column integer Frame. Such column specifies
directly which row indices are to be selected. This is more flexible
compared to a boolean column: the rows may be repeated, reordered,
omitted, etc. All values in the column i
must be in the range
[0; nrows)
or an error will be thrown. In particular, negative
indices are not allowed. Also, if the column contains NA values, then
it would produce an “invalid row”, i.e. a row filled with NAs.
i
may also be an expression, which must evaluate into a single
column, either boolean or integer. In this case the result is the
same as described above for a single-column frame.
When i
is a list of booleans, then it is equivalent to a single-column
boolean frame. In particular, the length of the list must be equal to
nrows
.
Finally, i
can be a list of any of the above (integers, slices, frames,
expressions, etc), in which case each element of the list is evaluated
separately and then all selected rows are put together. The list may
contain None
s, which will be simply skipped.
int
| str
| slice
| list
| dict
| type
| FExpr
| update
This argument may either select columns, or perform computations with the columns.
int
Select a single column at the specified index. An
IndexError
is raised ifj
is not in the range[-ncols; ncols)
.str
Select a single column by name. A
KeyError
is raised if the column with such a name does not exist.:
This is a trivial slice, and it means “select everything”, and is roughly equivalent to SQL’s
*
. In the simple case ofDT[i, j]
call “selecting everything” means all columns from frameDT
. However, when theby()
clause is added, then:
will now select all columns except those used in the groupby. And if the expression has ajoin()
, then “selecting everything” will produce all columns from all frames, excluding those that were duplicate during a natural join.slice[int]
An integer slice can be used to select a subset of columns. The behavior of a slice is exactly the same as in base Python.
slice[str]
A string slice is an expression like
"colA":"colZ"
. In this case all columns from"colA"
to"colZ"
inclusive are selected. And if"colZ"
appears before"colA
” in the frame, then the returned columns will be in the reverse order.Both endpoints of the slice must be valid columns (or omitted), or otherwise a
KeyError
will be raised.type
|stype
|ltype
Select only columns of the matching type.
FExpr
An expression formula is computed within the current evaluation context (i.e. it takes into account the current frame, the filter
i
, the presence of groupby/join parameters, etc). The result of this evaluation is used as-if that colum existed in the frame.List[bool]
If
j
is a list of boolean values, then it must have the length ofncols
, and it describes which columns are to be selected into the result.List[Any]
The
j
can also be a list of elements of any other type listed above, with the only restriction that the items must be homogeneous. For example, you can mixint
s andslice[int]
s, but notint
s andFExpr
s, orint
s andstr
s.Each item in the list will be evaluated separately (as if each was the sole element in
j
), and then all the results will be put together.Dict[str, FExpr]
A dictionary can be used to select columns/expressions similarly to a list, but assigning them explicit names.
update
As a special case, the
j
argument may be theupdate()
function, which turns the selection operation into an update. That is, instead of returning the chosen rows/columns, they will be updated instead with the user-supplied values.
by
When by()
clause is present in the square brackets, the rest of the
computations are carried out within the “context of a groupby”. This
should generally be equivalent to (a) splitting the frame into separate
sub-frames corresponding to each group, (b) applying DT[i, j]
separately within each group, (c) row-binding the results for each
group. In practice the following operations are affected:
all reduction operators such as
dt.min()
ordt.sum()
now work separately within each group. Thus, instead of computing sum over the entire column, it is computed separately within each group inby()
, and the resulting column will have as many rows as the number of groups.certain
i
expressions are re-interpreted as being applied within each group. For example, ifi
is an integer or a slice, then it will now be selecting row(s) within each group.certain functions (such as
dt.shift()
) are also “group-aware”, and produce results that take into account the groupby context. Check documentation for each individual function to find out whether it has special treatment for groupby contexts.
In addition, by()
also affects the order pf columns in the output
frame. Specifically, all columns listed as the groupby keys will be
automatically placed at the front of the resulting frame, and also
excluded from :
or f[:]
within j
.
sort
This argument can be used to rearrange rows in the resulting frame.
See sort()
for details.
join
Performs a JOIN operation with another frame. The
join()
clause will calculate how the rows
of the current frame match against the rows of the joined frame, and
allow you to refer to the columns of the joined frame within i
, j
or by
. In order to access columns of the joined frame use
namespace g.
.
This parameter may be listed multiple times if you need to join with several frames.
The order of evaluation of expressions is that first the join
clause(s)
are computed, creating a mapping between the rows of the current frame and
the joined frame(s). After that we evaluate by
+sort
. Next, the i
filter is applied creating the final index of rows that will be selected.
Lastly, we evaluate the j
part, taking into account the current groupby
and row index(es).
When evaluating j
, it is essentially converted into a tree (DAG) of
expressions, where each expression is evaluated from the bottom up. That
is, we start evaluating from the leaf nodes (which are usually column
selectors such as f[0]
), and then at each convert the set of columns
into a new set. Importantly, each subexpression node may produce columns
of 3 types: “scalar”, “grouped”, and “full-size”. Whenever subexpressions
of different levels are mixed together, they are upgraded to the highest
level. Thus, a scalar may be reused for each group, and a grouped column
can interoperate with a regular column by auto-expanding in such a way
that it becomes constant within each group.
If, after the j
is fully evaluated, it produces a column set of type
“grouped”, then the resulting frame will have as many rows as there are
groups. If, on the other hand, the column set is “full-size”, then the
resulting frame will have as many rows as the original frame.
DT[i, j, ...] = R
– update values in the frame.del DT[i, j, ...]
– delete rows/columns of the frame.
Extract a single column j
from the frame.
The single-argument version of DT[i, j]
works only for j
being
either an integer (indicating column index) or a string (column name).
If you need any other way of addressing column(s) of the frame, use the
more versatile DT[:, j]
form.
int
| str
The index or name of a column to retrieve.
Frame
Single-column frame containing the column at the specified index or with the given name.
KeyError
The exception is raised if the column with the given name does not exist in the frame.
IndexError
The exception is raised if the column does not exist at the provided
index j
.
datatable.Frame.__getstate__()¶
This method allows the frame to be pickle
-able.
Pickling a Frame involves saving it into a bytes
object in Jay format,
but may be less efficient than saving into a file directly because Python
creates a copy of the data for the bytes object.
See to_jay()
for more details and caveats about saving into Jay
format.
datatable.Frame.__iter__()¶
Returns an iterator over the frame’s columns.
The iterator is a light-weight object of type frame_iterator
,
which yields consequent columns of the frame with each iteration.
Thus, the iterator produces the sequence frame[0], frame[1],
frame[2], ...
until the end of the frame. This works even if
the user adds or deletes columns in the frame while iterating.
Be careful when inserting/deleting columns at an index that was
already iterated over, as it will cause some columns to be
skipped or visited more than once.
This method is not intended for manual use. Instead, it is
invoked by Python runtime either when you call iter()
,
or when you use the frame in a loop:
for column in frame:
# column is a Frame of shape (frame.nrows, 1)
...
datatable.Frame.__len__()¶
datatable.Frame.__repr__()¶
Returns a simple representation of the frame as a string. This
method is used by Python’s built-in function repr()
.
The returned string has the following format:
"<Frame#{ID} {nrows}x{ncols}>"
where {ID}
is the value of id(frame)
in hex format. Thus,
each frame has its own unique id, though after one frame is
deleted its id may be reused by another frame.
datatable.Frame.__reversed__()¶
Returns an iterator over the frame’s columns in reverse order.
This is similar to __iter__()
, except that the columns
are returned in the reverse order, i.e. frame[-1], frame[-2],
frame[-3], ...
.
This function is not intended for manual use. Instead, it is
invoked by Python builtin function reversed()
.
datatable.Frame.__setitem__()¶
This methods updates values within the frame, or adds new columns to the frame.
All parameters have the same meaning as in the getter
DT[i, j, ...]
, with the only
restriction that j
must select to columns by reference (i.e. there
could not be any computed columns there). On the other hand, j
may
contain columns that do not exist in the frame yet: these columns will be
created.
...
Row selector.
...
Column selector. Computed columns are forbidden, but not-existing (new) columns are allowed.
by
Groupby condition.
join
Join criterion.
FExpr
| List[FExpr]
| Frame
| type
| None
| bool
| int
| float
| str
The replacement for the selection on the left-hand-side.
None
|bool
|int
|float
|str
A simple python scalar can be assigned to any-shape selection on the LHS. If
i
selects all rows (i.e. the assignment is of the formDT[:, j] = R
), then each column inj
will be replaced with a constant column containing the valueR
.If, on the other hand,
i
selects only some rows, then the type ofR
must be consistent with the type of column(s) selected inj
. In this case only cells in subset[i, j]
will be updated with the value ofR
; the columns may be promoted within their ltype if the value ofR
is large in magnitude.type
|stype
|ltype
Assigning a type to one or more columns will change the types of those columns. The row selector
i
must be “slice-all”:
.Frame
|FExpr
|List[FExpr]
When a frame or an expression is assigned, then the shape of the RHS must match the shape of the LHS. Similarly to the assignment of scalars, types must be compatible when assigning to a subset of rows.
update()
– An alternative way to update values in the frame withinDT[i, j]
getter.Frame.replace()
– Search and replace for certain values within the entire frame.
A simplified form of the setter, suitable for a single-column replacement.
In this case j
may only be an integer or a string.
datatable.Frame.__sizeof__()¶
Return the size of this Frame in memory.
The function attempts to compute the total memory size of the frame as precisely as possible. In particular, it takes into account not only the size of data in columns, but also sizes of all auxiliary internal structures.
Special cases: if frame is a view (say, d2 = DT[:1000, :]
), then
the reported size will not contain the size of the data, because that
data “belongs” to the original datatable and is not copied. However if
a frame selects only a subset of columns (say, d3 = DT[:, :5]
),
then a view is not created and instead the columns are copied by
reference. Frame d3
will report the “full” size of its columns,
even though they do not occupy any extra memory compared to DT
.
This behavior may be changed in the future.
This function is not intended for manual use. Instead, in order to
get the size of a frame DT
, call sys.getsizeof(DT)
.
datatable.Frame.__str__()¶
Returns a string with the Frame’s data formatted as a table, i.e. the same representation as displayed when trying to inspect the frame from Python console.
Different aspects of the stringification process can be controlled
via dt.options.display
options; but under the default settings
the returned string will be sufficiently small to fit into a
typical terminal window. If the frame has too many rows/columns,
then only a small sample near the start+end of the frame will be
rendered.
datatable.Frame.cbind()¶
Append columns of one or more frames
to the current Frame.
For example, if the current frame has n
columns, and you are
appending another frame with k
columns, then after this method
succeeds, the current frame will have n + k
columns. Thus, this
method is roughly equivalent to pandas.concat(axis=1)
.
The frames being cbound must all either have the same number of rows, or some of them may have only a single row. Such single-row frames will be automatically expanded, replicating the value as needed. This makes it easy to create constant columns or to append reduction results (such as min/max/mean/etc) to the current Frame.
If some of the frames
have an incompatible number of rows, then the
operation will fail with an InvalidOperationError
. However, if
you set the flag force
to True, then the error will no longer be
raised - instead all frames that are shorter than the others will be
padded with NAs.
If the frames being appended have the same column names as the current frame, then those names will be mangled to ensure that the column names in the current frame remain unique. A warning will also be issued in this case.
Frame
| List[Frame]
| None
The list/tuple/sequence/generator expression of Frames to append
to the current frame. The list may also contain None
values,
which will be simply skipped.
bool
If True, allows Frames to be appended even if they have unequal number of rows. The resulting Frame will have number of rows equal to the largest among all Frames. Those Frames which have less than the largest number of rows, will be padded with NAs (with the exception of Frames having just 1 row, which will be replicated instead of filling with NAs).
None
This method alters the current frame in-place, and doesn’t return anything.
InvalidOperationError
If trying to cbind frames with the number of rows different from
the current frame’s, and the option force
is not set.
Cbinding frames is a very cheap operation: the columns are copied by
reference, which means the complexity of the operation depends only
on the number of columns, not on the number of rows. Still, if you
are planning to cbind a large number of frames, it will be beneficial
to collect them in a list first and then call a single cbind()
instead of cbinding them one-by-one.
It is possible to cbind frames using the standard DT[i,j]
syntax:
df[:, update(**frame1, **frame2, ...)]
Or, if you need to append just a single column:
df["newcol"] = frame1
DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0])
frame1 = dt.Frame(N=[-1, -2, -5])
DT.cbind(frame1)
DT
A | B | N | |
---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | 1 | 4 | −1 |
1 | 2 | 7 | −2 |
2 | 3 | 0 | −5 |
dt.cbind()
– function for cbinding frames “out-of-place” instead of in-place;rbind()
– method for row-binding frames.
datatable.Frame.colindex()¶
Return the position of the column
in the Frame.
The index of the first column is 0
, just as with regular python
lists.
str
| int
| FExpr
If string, then this is the name of the column whose index you want to find.
If integer, then this represents a column’s index. The return
value is thus the same as the input argument column
, provided
that it is in the correct range. If the column
argument is
negative, then it is interpreted as counting from the end of the
frame. In this case the positive value column + ncols
is
returned.
Lastly, column
argument may also be an
f-expression such as f.A
or f[3]
. This
case is treated as if the argument was simply "A"
or 3
. More
complicated f-expressions are not allowed and will result in a
TypeError
.
int
The numeric index of the provided column
. This will be an
integer between 0
and self.ncols - 1
.
KeyError
| IndexError
If the column
argument is a string, and the column with such
name does not exist in the frame, then a KeyError
is raised.
When this exception is thrown, the error message may contain
suggestions for up to 3 similarly looking column names that
actually exist in the Frame.
If the column
argument is an integer that is either greater
than or equal to ncols
or less than
-ncols
, then an IndexError
is raised.
df = dt.Frame(A=[3, 14, 15], B=["enas", "duo", "treis"],
C=[0, 0, 0])
df.colindex("B")
1
df.colindex(-1)
2
from datatable import f
df.colindex(f.A)
0
datatable.Frame.copy()¶
Make a copy of the frame.
The returned frame will be an identical copy of the original, including column names, types, and keys.
By default, copying is shallow with copy-on-write semantics. This means that only the minimal information about the frame is copied, while all the internal data buffers are shared between the copies. Nevertheless, due to the copy-on-write semantics, any changes made to one of the frames will not propagate to the other; instead, the data will be copied whenever the user attempts to modify it.
It is also possible to explicitly request a deep copy of the frame
by setting the parameter deep
to True
. With this flag, the
returned copy will be truly independent from the original. The
returned frame will also be fully materialized in this case.
bool
Flag indicating whether to return a “shallow” (default), or a “deep” copy of the original frame.
Frame
A new Frame, which is the copy of the current frame.
DT1 = dt.Frame(range(5))
DT2 = DT1.copy()
DT2[0, 0] = -1
DT2.to_list()
[[-1, 1, 2, 3, 4]]
DT1.to_list()
[[0, 1, 2, 3, 4]]
Non-deep frame copy is a very low-cost operation: its speed depends on the number of columns only, not on the number of rows. On a regular laptop copying a 100-column frame takes about 30-50µs.
Deep copying is more expensive, since the data has to be physically written to new memory, and if the source columns are virtual, then they need to be computed too.
Another way to create a copy of the frame is using a
DT[i, j]
expression (however, this will not copy the key property):DT[:, :]
Frame
class also supports copying via the standard Python librarycopy
:import copy DT_shallow_copy = copy.copy(DT) DT_deep_copy = copy.deepcopy(DT)
datatable.Frame.export_names()¶
Return a tuple of f-expressions for all columns of the frame.
For example, if the frame has columns “A”, “B”, and “C”, then this
method will return a tuple of expressions (f.A, f.B, f.C)
. If you
assign these to, say, variables A
, B
, and C
, then you
will be able to write column expressions using the column names
directly, without using the f
symbol:
A, B, C = DT.export_names()
DT[A + B > C, :]
The variables that are “exported” refer to each column by name. This means that you can use the variables even after reordering the columns. In addition, the variables will work not only for the frame they were exported from, but also for any other frame that has columns with the same names.
Tuple[Expr, ...]
The length of the tuple is equal to the number of columns in the
frame. Each element of the tuple is a datatable expression, and
can be used primarily with the DT[i,j]
notation.
This method is effectively equivalent to:
def export_names(self): return tuple(f[name] for name in self.names)
If you want to export only a subset of column names, then you can either subset the frame first, or use
*
-notation to ignore the names that you do not plan to use:A, B = DT[:, :2].export_names() # export the first two columns A, B, *_ = DT.export_names() # same
Variables that you use in code do not have to have the same names as the columns:
Price, Quantity = DT[:, ["sale price", "quant"]].export_names()
datatable.Frame.head()¶
Return the first n
rows of the frame.
If the number of rows in the frame is less than n
, then all rows
are returned.
This is a convenience function and it is equivalent to DT[:n, :]
.
int
The maximum number of rows to return, 10 by default. This number cannot be negative.
Frame
A frame containing the first up to n
rows from the original
frame, and same columns.
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates",
"eggplants", "figs", "grapes", "kiwi"])
DT.head(4)
A | |
---|---|
▪▪▪▪ | |
0 | apples |
1 | bananas |
2 | cherries |
3 | dates |
datatable.Frame.key¶
The tuple of column names that are the primary key for this frame.
If the frame has no primary key, this property returns an empty tuple.
The primary key columns are always located at the beginning of the frame, and therefore the following holds:
DT.key == DT.names[:len(DT.key)]
Assigning to this property will make the Frame keyed by the specified column(s). The key columns will be moved to the front, and the Frame will be sorted. The values in the key columns must be unique.
Tuple[str, ...]
When used as a getter, returns the tuple of names of the primary key columns.
str
| List[str]
| Tuple[str, ...]
| None
Specify a column or a list of columns that will become the new primary key of the Frame. Object columns cannot be used for a key. The values in the key column must be unique; if multiple columns are assigned as the key, then their combined (tuple-like) values must be unique.
If newkey
is None
, then this is equivalent to deleting the
key. When the key is deleted, the key columns remain in the frame,
they merely stop being marked as “key”.
ValueError
Raised when the values in the key column(s) are not unique.
KeyError
Raised when newkey
contains a column name that doesn’t exist
in the Frame.
DT = dt.Frame(A=range(5), B=['one', 'two', 'three', 'four', 'five'])
DT.key = 'B'
DT
B | A |
---|---|
▪▪▪▪ | ▪▪▪▪ |
five | 4 |
four | 3 |
one | 0 |
three | 2 |
two | 1 |
datatable.Frame.keys()¶
datatable.Frame.ltypes¶
datatable.Frame.materialize()¶
Force all data in the Frame to be laid out physically.
In datatable, a Frame may contain “virtual” columns, i.e. columns whose data is computed on-the-fly. This allows us to have better performance for certain types of computations, while also reduce the total memory footprint. The use of virtual columns is generally transparent to the user, and datatable will materialize them as needed.
However, there could be situations where you might want to materialize your Frame explicitly. In particular, materialization will carry out all delayed computations and break internal references on other Frames’ data. Thus, for example if you subset a large frame to create a smaller subset, then the new frame will carry an internal reference to the original, preventing it from being garbage-collected. However, if you materialize the small frame, then the data will be physically copied, allowing the original frame’s memory to be freed.
bool
If True, then, in addition to de-virtualizing all columns, this method will also copy all memory-mapped columns into the RAM.
When you open a Jay file, the Frame that is created will contain
memory-mapped columns whose data still resides on disk. Calling
.materialize(to_memory=True)
will force the data to be loaded
into the main memory. This may be beneficial if you are concerned
about the disk speed, or if the file is on a removable drive, or
if you want to delete the source file.
None
This operation modifies the frame in-place.
datatable.Frame.names¶
The tuple of names of all columns in the frame.
Each name is a non-empty string not containing any ASCII control characters, and jointly the names are unique within the frame.
This property is also assignable: setting DT.names
has the effect
of renaming the frame’s columns without changing their order. When
renaming, the length of the new list of names must be the same as the
number of columns in the frame. It is also possible to rename just a
few of the columns by assigning a dictionary {oldname: newname}
.
Any column not listed in the dictionary will keep its old name.
When setting new column names, we will verify whether they satisfy the requirements mentioned above. If not, a warning will be emitted and the names will be automatically mangled.
Tuple[str, ...]
When used in getter form, this property returns the names of all
frame’s columns, as a tuple. The length of the tuple is equal to
the number of columns in the frame, ncols
.
List[str?]
| Tuple[str?, ...]
| Dict[str, str?]
| None
The most common form is to assign the list or tuple of new
column names. The length of the new list must be equal to the
number of columns in the frame. Some (or all) elements in the list
may be None
’s, indicating that that column should have
an auto-generated name.
If newnames
is a dictionary, then it provides a mapping from
old to new column names. The dictionary may contain less entries
than the number of columns in the frame: the columns not mentioned
in the dictionary will retain their names.
Setting the .names
to None
is equivalent to using the
del
keyword: the names will be set to their default values,
which are usually C0, C1, ...
.
ValueError
If the length of the list/tuple newnames
does not match the
number of columns in the frame.
KeyError
If newnames
is a dictionary containing entries that do not
match any of the existing columns in the frame.
DT = dt.Frame([[1], [2], [3]])
DT.names = ['A', 'B', 'C']
DT.names
('A', 'B', 'C')
DT.names = {'B': 'middle'}
DT.names
('A', 'middle', 'C')
del DT.names
DT.names
('C0', 'C1', 'C2)
datatable.Frame.ncols¶
datatable.Frame.nrows¶
Number of rows in the Frame.
Assigning to this property will change the height of the Frame, either by truncating if the new number of rows is smaller than the current, or filling with NAs if the new number of rows is greater.
Increasing the number of rows of a keyed Frame is not allowed.
int
The number of rows can be either zero or a positive integer.
int
The new number of rows for the frame, this should be a nonnegative integer.
datatable.Frame.rbind()¶
Append rows of frames
to the current frame.
This is equivalent to list.extend()
in Python: the frames are
combined by rows, i.e. rbinding a frame of shape [n x k] to a Frame
of shape [m x k] produces a frame of shape [(m + n) x k].
This method modifies the current frame in-place. If you do not want
the current frame modified, then use the rbind()
function.
If frame(s) being appended have columns of types different from the current frame, then these columns will be promoted to the largest of their types: bool -> int -> float -> string.
If you need to append multiple frames, then it is more efficient to
collect them into an array first and then do a single rbind()
, than
it is to append them one-by-one in a loop.
Appending data to a frame opened from disk will force loading the current frame into memory, which may fail with an OutOfMemory exception if the frame is sufficiently big.
Frame
| List[Frame]
One or more frame to append. These frames should have the same
columnar structure as the current frame (unless option force
is
used).
bool
If True, then the frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs.
bool
If True (default), the columns in frames are matched by their
names. For example, if one frame has columns [“colA”, “colB”,
“colC”] and the other [“colB”, “colA”, “colC”] then we will swap
the order of the first two columns of the appended frame before
performing the append. However if bynames
is False, then the
column names will be ignored, and the columns will be matched
according to their order, i.e. i-th column in the current frame
to the i-th column in each appended frame.
None
datatable.Frame.replace()¶
Replace given value(s) replace_what
with replace_with
in the entire Frame.
For each replace value, this method operates only on columns of types
appropriate for that value. For example, if replace_what
is a list
[-1, math.inf, None, "??"]
, then the value -1
will be replaced in integer
columns only, math.inf
only in real columns, None
in columns of all types,
and finally "??"
only in string columns.
The replacement value must match the type of the target being replaced,
otherwise an exception will be thrown. That is, a bool must be replaced with a
bool, an int with an int, a float with a float, and a string with a string.
The None
value (representing NA) matches any column type, and therefore can
be used as either replacement target, or replace value for any column. In
particular, the following is valid: DT.replace(None, [-1, -1.0, ""])
. This
will replace NA values in int columns with -1
, in real columns with -1.0
,
and in string columns with an empty string.
The replace operation never causes a column to change its logical type. Thus,
an integer column will remain integer, string column remain string, etc.
However, replacing may cause a column to change its stype, provided that
ltype remains constant. For example, replacing 0
with -999
within an int8
column will cause that column to be converted into the int32
stype.
None
| bool
| int
| float
| list
| dict
Value(s) to search for and replace.
single value
| list
The replacement value(s). If replace_what
is a single value, then this
must be a single value too. If replace_what
is a list, then this could
be either a single value, or a list of the same length. If replace_what
is a dict, then this value should not be passed.
None
Nothing is returned, the replacement is performed in-place.
df = dt.Frame([1, 2, 3] * 3)
df.replace(1, -1)
df.to_list()
[[-1, 2, 3, -1, 2, 3, -1, 2, 3]]
df.replace({-1: 100, 2: 200, "foo": None})
df.to_list()
[[100, 200, 3, 100, 200, 3, 100, 200, 3]]
datatable.Frame.shape¶
datatable.Frame.sort()¶
Sort frame by the specified column(s).
List[str | int]
Names or indices of the columns to sort by. If no columns are given, the Frame will be sorted on all columns.
Frame
New Frame sorted by the provided column(s). The current frame remains unmodified.
datatable.Frame.source¶
The name of the file where this frame was loaded from.
This is a read-only property that describes the origin of the frame. When a frame is loaded from a Jay or CSV file, this property will contain the name of that file. Similarly, if the frame was opened from a URL or a from a shell command, the source will report the original URL / the command.
Certain sources may be converted into a Frame only partially,
in such case the source
property will attempt to reflect this
fact. For example, when opening a multi-file zip archive, the
source will contain the name of the file within the archive.
Similarly, when opening an XLS file with several worksheets, the
source property will contain the name of the XLS file, the name of
the worksheet, and possibly even the range of cells that were read.
str
| None
If the frame was loaded from a file or similar resource, the
name of that file is returned. If the frame was computed, or its
data modified, the property will return None
.
datatable.Frame.stype¶
The common stype
for all columns.
This property is well-defined only for frames where all columns have the same stype.
stype
| None
For frames where all columns have the same stype, this common
stype is returned. If a frame has 0 columns, None
will be
returned.
InvalidOperationError
This exception will be raised if the columns in the frame have different stypes.
datatable.Frame.stypes¶
datatable.Frame.tail()¶
Return the last n
rows of the frame.
If the number of rows in the frame is less than n
, then all rows
are returned.
This is a convenience function and it is equivalent to DT[-n:, :]
(except when n
is 0).
int
The maximum number of rows to return, 10 by default. This number cannot be negative.
Frame
A frame containing the last up to n
rows from the original
frame, and same columns.
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates",
"eggplants", "figs", "grapes", "kiwi"])
DT.tail(3)
A | |
---|---|
▪▪▪▪ | |
0 | figs |
1 | grapes |
2 | kiwi |
datatable.Frame.to_csv()¶
Write the contents of the Frame into a CSV file.
This method uses multiple threads to serialize the Frame’s data. The
number of threads is can be configured using the global option
dt.options.nthreads
.
The method supports simple writing to file, appending to an existing file, or creating a python string if no filename was provided. Optionally, the output could be gzip-compressed.
str
Path to the output CSV file that will be created. If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.
csv.QUOTE_*
| "minimal"
| "all"
| "nonnumeric"
| "none"
"minimal"
|csv.QUOTE_MINIMAL
quote the string fields only as necessary, i.e. if the string starts or ends with the whitespace, or contains quote characters, separator, or any of the C0 control characters (including newlines, etc).
"all"
|csv.QUOTE_ALL
all fields will be quoted, both string, numeric, and boolean.
"nonnumeric"
|csv.QUOTE_NONNUMERIC
all string fields will be quoted.
"none"
|csv.QUOTE_NONE
none of the fields will be quoted. This option must be used at user’s own risk: the file produced may not be valid CSV.
bool
| "auto"
This option controls whether or not to write headers into the
output file. If this option is not given (or equal to …), then
the headers will be written unless the option append
is True
and the file path
already exists. Thus, by default the headers
will be written in all cases except when appending content into
an existing file.
bool
If True, then insert the byte-order mark into the output file (the option is False by default). Even if the option is True, the BOM will not be written when appending data to an existing file.
According to Unicode standard, including BOM into text files is “neither required nor recommended”. However, some programs (e.g. Excel) may not be able to recognize file encoding without this mark.
bool
If True, then all floating-point values will be printed in hex
format (equivalent to %a format in C printf
). This format is
around 3 times faster to write/read compared to usual decimal
representation, so its use is recommended if you need maximum
speed.
None
| "gzip"
| "auto"
Which compression method to use for the output stream. The default
is “auto”, which tries to infer the compression method from the
output file’s name. The only compression format currently supported
is “gzip”. Compression may not be used when append
is True.
bool
If True, some extra information will be printed to the console, which may help to debug the inner workings of the algorithm.
"mmap"
| "write"
| "auto"
Which method to use for writing to disk. On certain systems ‘mmap’ gives a better performance; on other OSes ‘mmap’ may not work at all.
None
| str
| bytes
None if path
is non-empty. This is the most common case: the
output is written to the file provided.
String containing the CSV text as if it would have been written to a file, if the path is empty or None. If the compression is turned on, a bytes object will be returned instead.
datatable.Frame.to_dict()¶
Convert the frame into a dictionary of lists, by columns.
In Python 3.6+ the order of records in the dictionary will be the same as the order of columns in the frame.
Dict[str, List]
Dictionary with ncols
records. Each record
represents a single column: the key is the column’s name, and the
value is the list with the column’s data.
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
DT.to_dict()
{"A": [1, 2, 3], "B": ["aye", "nay", "tain"]}
to_list()
: convert the frame into a list of liststo_tuples()
: convert the frame into a list of tuples by rows
datatable.Frame.to_jay()¶
Save this frame to a binary file on disk, in .jay
format.
str
| None
The destination file name. Although not necessary, we recommend
using extension “.jay” for the file. If the file exists, it will
be overwritten.
If this argument is omitted, the file will be created in memory
instead, and returned as a bytes
object.
'mmap'
| 'write'
| 'auto'
Which method to use for writing the file to disk. The “write”
method is more portable across different operating systems, but
may be slower. This parameter has no effect when path
is
omitted.
datatable.Frame.to_list()¶
Convert the frame into a list of lists, by columns.
List[List]
A list of ncols
lists, each inner list
representing one column of the frame.
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
DT.to_list()
[[1, 2, 3], ["aye", "nay", "tain"]]
dt.Frame(id=range(10)).to_list()
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
datatable.Frame.to_numpy()¶
Convert frame into a 2D numpy array, optionally forcing it into the specified stype/dtype.
In a limited set of circumstances the returned numpy array will be created as a data view, avoiding copying the data. This happens if all of these conditions are met:
the frame has only 1 column, which is not virtual;
the column’s type is not string;
the
stype
argument was not used.
In all other cases the returned numpy array will have a copy of the frame’s data. If the frame has multiple columns of different stypes, then the values will be upcasted into the smallest common stype.
If the frame has any NA values, then the returned numpy array will
be an instance of numpy.ma.masked_array
.
stype
| numpy.dtype
| str
| type
Cast frame into this stype before converting it into a numpy array.
int
Convert only the specified column; the returned value will be a 1D-array instead of a regular 2D-array.
numpy.array
ImportError
If the numpy
module is not installed.
datatable.Frame.to_pandas()¶
Convert this frame to a pandas DataFrame.
datatable.Frame.to_tuples()¶
datatable.Frame.view()¶
This function is currently not working properly
datatable.ltype¶
Enumeration of possible “logical” types of a column.
Logical type is the type stripped away from the details of its physical
storage. For example, ltype.int
represents an integer. Under the hood,
this integer can be stored in several “physical” formats: from
stype.int8
to stype.int64
. Thus, there is a one-to-many relationship
between ltypes and stypes.
Values¶
The following ltype values are currently available:
ltype.bool
ltype.int
ltype.real
ltype.str
ltype.time
ltype.obj
Examples¶
dt.ltype.bool
ltype.bool
dt.ltype("int32")
ltype.int
For each ltype, you can find the set of stypes that correspond to it:
dt.ltype.real.stypes
[stype.float32, stype.float64]
dt.ltype.time.stypes
[]
datatable.ltype.__new__()¶
Find an ltype corresponding to value
.
This method is similar to stype.__new__()
, except that it
returns an ltype instead of an stype.
datatable.Namespace¶
A namespace is an environment that provides lazy access to columns of
a frame when performing computations within
DT[i,j,...]
.
This class should not be instantiated directly, instead use the
singleton instances f
and g
exported from the datatable
module.
Special methods¶
Access columns as attributes. |
|
Access columns by their names / indices. |
datatable.Namespace.__getitem__()¶
Retrieve column(s) by their indices/names/types.
By “retrieve” we actually mean that an expression is created
such that when that expression is used within the
DT[i,j]
call, it would locate and
return the specified column(s).
int
| str
| slice
| None
| type
| stype
| ltype
The column selector:
int
Retrieve the column at the specified index. For example,
f[0]
denotes the first column, whlief[-1]
is the last.str
Retrieve a column by name.
slice
Retrieve a slice of columns from the namespace. Both integer and string slices are supported.
None
Retrieve no columns (an empty columnset).
type
|stype
|ltype
Retrieve columns matching the specified type.
FExpr
An expression that selects the specified column from a frame.
f-expressions – user guide on using f-expressions.
datatable.Namespace.__getattribute__()¶
Retrieve a column from the namespace by name
.
This is a convenience form that can be used to access simply-named
columns. For example: f.Age
denotes a column called "Age"
,
and is exactly equivalent to f['Age']
.
str
Name of the column to select.
FExpr
An expression that selects the specified column from a frame.
__getitem__()
– retrieving columns via the[]
-notation.
datatable.stype¶
Enumeration of possible “storage” types of columns in a Frame.
Each column in a Frame is a vector of values of the same type. We call
this column’s type the “stype”. Most stypes correspond to primitive C types,
such as int32_t
or double
. However some stypes (corresponding to
strings and categoricals) have a more complicated underlying structure.
Notably, datatable
does not support arbitrary structures as
elements of a Column, so the set of stypes is small.
Values¶
The following stype values are currently available:
stype.bool8
stype.int8
stype.int16
stype.int32
stype.int64
stype.float32
stype.float64
stype.str32
stype.str64
stype.obj64
They are available either as properties of the dt.stype
class,
or directly as constants in the dt.
namespace.
For example:
>>> dt.stype.int32
stype.int32
>>> dt.int64
stype.int64
Methods¶
Find stype corresponding to value |
|
Cast a column into the specific stype. |
|
ctypes type corresponding to this stype. |
|
numpy dtype corresponding to this stype. |
|
|
|
|
|
The smallest numeric value for this stype. |
|
The largest numeric value for this stype. |
datatable.stype.__call__()¶
Cast column col
into the new stype.
An stype can be used as a function that converts columns into that
specific stype. In the same way as you could write int(3.14)
in
Python to convert a float value into integer, you can likewise
write dt.int32(f.A)
to convert column A
into stype int32
.
FExpr
A single- or multi- column expression. All columns will be converted into the desired stype.
FExpr
Expression that converts its inputs into the current stype.
datatable.stype.__new__()¶
Find an stype corresponding to value
.
This method is called when you attempt to construct a new
stype
object, for example dt.stype(int)
. Instead of
actually creating any new stypes, we return one of the existing
stype values.
str
| type
| np.dtype
An object that will be converted into an stype. This could be
a string such as "integer"
or "int"
or "int8"
, a python
type such as bool
or float
, or a numpy dtype.
ValueError
Raised if value
does not correspond to any stype.
dt.stype(str)
stype.str64
dt.stype("double")
stype.float64
dt.stype(numpy.dtype("object"))
stype.obj64
dt.stype("int64")
stype.int64
datatable.stype.ctype¶
ctypes
class that describes the C-level type of each element
in a column with this stype.
For non-fixed-width columns (such as str32
) this will return the ctype
of only the fixed-width component of that column. Thus,
stype.str32.ctype == ctypes.c_int32
.
datatable.stype.dtype¶
numpy.dtype
object that corresponds to this stype.
datatable.stype.ltype¶
ltype
corresponding to this stype. Several stypes may map to
the same ltype, whereas each stype is described by exactly one ltype.
datatable.stype.max¶
The largest finite value that this stype can represent.
datatable.stype.min¶
The smallest finite value that this stype can represent.
datatable.build_info¶
This is a python struct that contains information about the installed datatable module. The following fields are available:
str
The version string of the current build. Several formats of the version string are possible:
{MAJOR}.{MINOR}.{MICRO}
– the release version string, such as"0.11.0"
.{RELEASE}a{DEVNUM}
– version string for the development build of datatable, where{RELEASE}
is the normal release string and{DEVNUM}
is an integer that is incremented with each build. For example:"0.11.0a1776"
.{RELEASE}a0+{SUFFIX}
– version string for a PR build of datatable, where the{SUFFIX}
is formed from the PR number and the build sequence number. For example,"0.11.0a0+pr2602.13"
.{RELEASE}a0+{FLAVOR}.{TIMESTAMP}.{USER}
– version string used for local builds. This contains the “flavor” of the build, such as normal build, or debug, or coverage, etc; the unix timestamp of the build; and lastly the system user name of the user who made the build.
str
UTC timestamp (date + time) of the build.
str
Git-hash of the revision from which the build was made, as
obtained from git rev-parse HEAD
.
str
Name of the git branch from where the build was made. This will
be obtained from environment variable CHANGE_BRANCH
if defined,
or from command git rev-parse --abbrev-ref HEAD
otherwise.
str
Timestamp of the git commit from which the build was made.
str
If the source tree contains any uncommitted changes (compared
to the checked out git revision), then the summary of these
changes will be in this field, as reported by
git diff HEAD --stat --no-color
. Otherwise, this field
is an empty string.
datatable.by()¶
Group-by clause for use in Frame’s square-bracket selector.
Whenever a by()
object is present inside a DT[i, j, ...]
expression, it makes all other expressions to be evaluated in
group-by mode. This mode causes the following changes to the
evaluation semantics:
A “Groupby” object will be computed for the frame
DT
, grouping it by columns specified as the arguments to theby()
call. This object keeps track of which rows of the frame belong to which group.If an
i
expression is present (row filter), it will be interpreted within each group. For example, ifi
is a slice, then the slice will be applied separately to each group. Similarly, ifi
expression contains a formula with reduce functions, then those functions will be evaluated for each group. For example:DT[f.A == max(f.A), :, by(f.group_id)]
will select those rows where column A reaches its peak value within each group (there could be multiple such rows within each group).
Before
j
is evaluated, theby()
clause adds all its columns at the start ofj
(unlessadd_columns
argument is False). Ifj
is a “select-all” slice (i.e.:
), then those columns will also be excluded from the list of all columns so that they will be present in the output only once.During evaluation of
j
, the reducer functions, such asmin()
,sum()
, etc, will be evaluated by-group, that is they will find the minimal value in each group, the sum of values in each group, and so on. If a reducer expression is combined with a regular column expression, then the reduced column will be auto-expanded into a column that is constant within each group.Note that if both
i
andj
contain reducer functions, then those functions will have slightly different notion of groups: the reducers ini
will see each group “in full”, whereas the reducers inj
will see each group after it was filtered by the expression ini
(and possibly not even see some of the groups at all, if they were filtered out completely).If
j
contains only reducer expressions, then the final result will be a Frame containing containing just a single row for each group. This resulting frame will also be keyed by the grouped-by columns.
The by()
function expects a single column or a sequence of columns
as the argument(s). It accepts either a column name, or an
f-expression. In particular, you can perform a group-by on a
dynamically computed expression:
DT[:, :, by(dt.math.floor(f.A/100))]
The default behavior of groupby is to sort the groups in the ascending
order, with NA values appearing before any other values. As a special
case, if you group by an expression -f.A
, then it will be
treated as if you requested to group by the column “A” sorting it in
the descending order. This will work even with column types that are
not arithmetic, for example “A” could be a string column here.
datatable.cbind()¶
Create a new Frame by appending columns from several frames
.
This function is exactly equivalent to:
dt.Frame().cbind(*frames, force=force)
See also¶
rbind()
– function for row-binding several frames.Frame.cbind()
– Frame method for cbinding some frames to another.
datatable.corr()¶
Calculate the
Pearson correlation
between col1
and col2
.
Parameters¶
Expr
Input columns.
datatable.count()¶
Calculate the number of non-missing values for each column from cols
.
datatable.cov()¶
Calculate
covariance
between col1
and col2
.
Parameters¶
Expr
Input columns.
datatable.cut()¶
Cut all the columns from cols
by binning their values into
equal-width discrete intervals.
Parameters¶
FExpr
Input data for equal-width interval binning.
int
| List[int]
bool
Each binning interval is half-open. This flag indicates which side of the interval is closed.
FExpr
f-expression that converts input columns into the columns filled with the respective bin ids.
datatable.dt¶
This is the datatable
module itself.
The purpose of exporting this symbol is so that you can easily import all the things you need from the datatable module in one go:
from datatable import dt, f, g, by, join, mean
Note: while it is possible to write
test = dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('test.jay')
train = dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('train.jay')
we do not in fact recommend doing so (except possibly on April 1st).
datatable.f¶
The main Namespace
object.
The function of this object is that during the evaluation of a
DT[i,j]
call, the variable f
represents the columns of frame DT
.
Specifically, within expression DT[i, j]
the following
is true:
f.A
means “column A” of frameDT
;f[2]
means “3rd colum” of frameDT
;f[int]
means “all integer columns” ofDT
;f[:]
means “all columns” ofDT
.
datatable.first()¶
datatable.fread()¶
This function is capable of reading data from a variety of input formats,
producing a Frame
as the result. The recognized formats are:
CSV, Jay, XLSX, and plain text. In addition, the data may be inside an
archive such as .tar
, .gz
, .zip
, .gz2
, and .tgz
.
Parameters¶
str
| bytes
| file
| Pathlike
| List
The first (unnamed) argument to fread is the input source.
Multiple types of sources are supported, and they can be named
explicitly: file
, text
, cmd
, and url
. When the source is
not named, fread will attempt to guess its type. The most common
type is file
, but sometimes the argument is resolved as text
(if the string contains newlines) or url
(if the string starts
with https://
or similar).
Only one argument out of anysource
, file
, text
, cmd
or
url
can be specified at once.
str
| file
| Pathlike
A file source can be either the name of the file on disk, or a
python “file-like” object – i.e. any object having method
.read()
.
Generally, specifying a file name should be preferred, since
reading from a Python file
can only be done in single-threaded
mode.
This argument also supports addressing files inside an archive,
or sheets inside an Excel workbook. Simply write the name of the
file as if the archive was a folder: "data.zip/train.csv"
.
str
| bytes
Instead of reading data from file, this argument provides the data as a simple in-memory blob.
str
A command that will be executed in the shell and its output then read as text.
str
This parameter can be used to specify the URL of the input file. The data will first be downloaded into a temporary directory and then read from there. In the end the temporary files will be removed.
We use the standard urllib.request
module to download the
data. Changing the settings of that module, for example installing
proxy, password, or cookie managers will allow you to customize
the download process.
...
Limit which columns to read from the input file.
str
| None
Field separator in the input file. If this value is None
(default) then the separator will be auto-detected. Otherwise it
must be a single-character string. When sep='\n'
, then the
data will be read in single-column mode. Characters
["'`0-9a-zA-Z]
are not allowed as the separator, as well as
any non-ASCII characters.
"."
| ","
Decimal point symbol for floating-point numbers.
int
The maximum number of rows to read from the file. Setting this parameter to any negative number is equivalent to have no limit at all. Currently this parameter doesn’t always work correctly.
bool
| None
If True
then the first line of the CSV file contains the header.
If False
then there is no header. By default the presence of the
header is heuristically determined from the contents of the file.
List[str]
The list of strings that were used in the input file to represent NA values.
bool
If True
then the lines of the CSV file are allowed to have
uneven number of fields. All missing fields will be filled with
NAs in the resulting frame.
str
| None
If this parameter is provided, then the input will be recoded
from this encoding into UTF-8 before reading. Any encoding
registered with the python codec
module can be used.
str
| None
Start reading the file from the line containing this string. All
previous lines will be skipped and discarded. This parameter
cannot be used together with skip_to_line
.
int
If this setting is given, then this many lines in the file will
be skipped before we start to parse the file. This can be used
for example when several first lines in the file contain non-CSV
data and therefore must be skipped. This parameter cannot be
used together with skip_to_string
.
bool
If True
then any empty lines in the input will be skipped. If
this parameter is False
then: (a) in single-column mode empty
lines are kept as empty lines; otherwise (b) if fill=True
then
empty lines produce a single line filled with NAs in the output;
otherwise (c) an IOError
is raised.
bool
If True
, then the leading/trailing whitespace will be stripped
from unquoted string fields. Whitespace is always skipped from
numeric fields.
'"'
| "'"
| "`"
The character that was used to quote fields in the CSV file. By
default the double-quote mark '"'
is assumed.
str
| None
Use this directory for storing temporary files as needed. If not
provided then the system temporary directory will be used, as
determined via the tempfile
Python module.
int
| None
Number of threads to use when reading the file. This number cannot
exceed the number of threads in the pool dt.options.nthreads
.
If 0
or negative number of threads is requested, then it will be
treated as that many threads less than the maximum. By default
all threads in the thread pool are used.
bool
If True
, then print detailed information about the internal
workings of fread to stdout (or to logger
if provided).
object
Logger object that will receive verbose information about fread’s
progress. When this parameter is specified, verbose
mode will
be turned on automatically.
"warn"
| "error"
| "ignore"
Action that should be taken when the input resolves to multiple
distinct sources. By default ("warn"
) a warning will be issued
and only the first source will be read and returned as a Frame.
The "ignore"
action is similar, except that the extra sources
will be discarded without a warning. Lastly, an IOError
can be raised if the value of this parameter is "error"
.
If you want all sources to be read instead of only the first one
then consider using iread()
.
int
Try not to exceed this amount of memory allocation (in bytes) when reading the data. This limit is advisory and not enforced very strictly.
This setting is useful when reading data from a file that is substantially larger than the amount of RAM available on your machine.
When this parameter is specified and fread sees that it needs more RAM than the limit in order to read the input file, then it will dump the data that was read so far into a temporary file in binary format. In the end the returned Frame will be partially composed from data located on disk, and partially from the data in memory. It is advised to either store this data as a Jay file or filter and materialize the frame (if not the performance may be slow).
Frame
A single Frame
object is always returned.
Changed in version 0.11.0: Previously a dict
of Frames was returned when multiple
input sources were provided.
IOError
datatable.g¶
Secondary Namespace
object.
The function of this object is that during the evaluation of a
DT[..., join(X)]
call, the variable
g
represents the columns of the joined frame X
. In SQL
this would have been equivalent to ... JOIN tableX AS g ...
.
datatable.init_styles()¶
Inject datatable’s stylesheets into the Jupyter notebook. This function does nothing when it runs in a normal Python environment outside of Jupyter.
When datatable runs in a Jupyter notebook, it renders its Frames as HTML tables. The appearance of these tables is enhanced using a custom stylesheet, which must be injected into the notebook at any point on the page. This is exactly what this function does.
Normally, this function is called automatically when datatable
is imported. However, in some circumstances Jupyter erases these
stylesheets (for example, if you run import datatable
cell
twice). In such cases, you may need to call this method manually.
datatable.ifelse()¶
Produce a column that chooses one of the two values based on the condition.
This function will only compute those values that are needed for
the result. Thus, for each row we will evaluate either expr_if_true
or expr_if_false
(based on the condition
value) but not both.
This may be relevant for those cases
Parameters¶
FExpr
An expression yielding a single boolean column.
FExpr
Values that will be used when the condition evaluates to True. This must be a single column.
FExpr
Values that will be used when the condition evaluates to False. This must be a single column.
FExpr
The resulting expression is a single column whose stype is the
stype which is common for expr_if_true
and expr_if_false
,
i.e. it is the smallest stype into which both exprs can be
upcasted.
datatable.intersect()¶
Find the intersection of sets of values in the frames
.
Each frame should have only a single column or be empty.
The values in each frame will be treated as a set, and this function will
perform the
intersection operation
on these sets, returning those values that are present in each
of the provided frames
.
Parameters¶
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns in the frames
.
ValueError
The exception is raised when one of the input frames has more than one column.
NotImplementedError
The exception is raised when one of the frame columns has stype
obj64
.
datatable.iread()¶
This function is similar to fread()
, but allows reading
multiple sources at once. For example, this can be used when the
input is a list of files, or a glob pattern, or a multi-file archive,
or multi-sheet XLSX file, etc.
Parameters¶
...
Most parameters are the same as in fread()
. All parse
parameters will be applied to all input files.
"warn"
| "raise"
| "ignore"
| "store"
What action to take when one of the input sources produces an
error. Possible actions are: "warn"
– each error is converted
into a warning and emitted to user, the source that produced the
error is then skipped; "raise"
– the errors are raised
immediately and the iteration stops; "ignore"
– the erroneous
sources are silently ignored; "store"
– when an error is
raised, it is captured and returned to the user, then the iterator
continues reading the subsequent sources.
Iterator[Frame]
| Iterator[Frame|Exception]
The returned object is an iterator that produces Frame
s.
The iterator is lazy: each frame is read only as needed, after the
previous frame was “consumed” by the user. Thus, the user can
interrupt the iterator without having to read all the frames.
Each Frame
produced by the iterator has a .source
attribute that describes the source of each frame as best as
possible. Each source depends on the type of the input: either a
file name, or a URL, or the name of the file in an archive, etc.
If the errors
parameter is "store"
then the iterator may
produce either Frames or exception objects.
datatable.join()¶
Join clause for use in Frame’s square-bracket selector.
This clause is equivalent to the SQL JOIN
, though for the moment
datatable only supports left outer joins. In order to join,
the frame
must be keyed
first, and then joined
to another frame DT
as
DT[:, :, join(X)]
provided that DT
has the column(s) with the same name(s) as
the key in frame
.
Parameters¶
Frame
An input keyed frame to be joined to the current one.
Join Object
In most of the cases the returned object is directly used in the Frame’s square-bracket selector.
TypeError
The exception is raised if the input frame is missing.
ValueError
The exception is raised if frame
is not keyed.
See Also¶
datatable.last()¶
datatable.max()¶
Calculate the maximum value for each column from cols
. It is recommended
to use it as dt.max()
to prevent conflict with the Python built-in
max()
function.
datatable.mean()¶
Calculate the mean value for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names and number of columns
as in cols
. The column stypes are float32
for
float32
columns, and float64
for all the other numeric types.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
datatable.median()¶
datatable.min()¶
Calculate the minimum value for each column from cols
. It is recommended
to use it as dt.min()
to prevent conflict with the Python built-in
min()
function.
datatable.qcut()¶
Bin all the columns from cols
into intervals with approximately
equal populations. Thus, the intervals are chosen according to
the sample quantiles of the data.
If there are duplicate values in the data, they will all be placed into the same bin. In extreme cases this may cause the bins to be highly unbalanced.
Parameters¶
FExpr
Input data for quantile binning.
int
| List[int]
FExpr
f-expression that converts input columns into the columns filled with the respective quantile ids.
datatable.rowall()¶
For each row in cols
return True
if all values in that row are True
,
or otherwise return False
.
datatable.rowany()¶
For each row in cols
return True
if any of the values in that row
are True
, or otherwise return False
. The function uses shortcut
evaluation: if the True
value is found in one of the columns,
then the subsequent columns are skipped.
datatable.rowcount()¶
datatable.rowfirst()¶
For each row, find the first non-missing value in cols
. If all values
in a row are missing, then this function will also produce a missing value.
datatable.rowlast()¶
For each row, find the last non-missing value in cols
. If all values
in a row are missing, then this function will also produce a missing value.
Parameters¶
Expr
Input columns.
Expr
f-expression consisting of one column and the same number
of rows as in cols
.
TypeError
The exception is raised when input columns have incompatible types.
See Also¶
rowfirst()
– find the first non-missing value row-wise.
datatable.rowmax()¶
datatable.rowmean()¶
datatable.rowmin()¶
datatable.rowsd()¶
datatable.rowsum()¶
For each row, calculate the sum of all values in cols
. Missing values
are treated as if they are zeros and skipped during the calcultion.
Parameters¶
Expr
Input columns.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
See Also¶
rowcount()
– count non-missing values row-wise.
datatable.rbind()¶
Produce a new frame by appending rows of frames
.
This function is equivalent to:
dt.Frame().rbind(*frames, force=force, by_names=by_names)
See also¶
cbind()
– function for col-binding several frames.Frame.rbind()
– Frame method for rbinding some frames to another.
datatable.repeat()¶
datatable.sd()¶
Calculate the standard deviation for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names and number of columns
as in cols
. The column stypes are float32
for
float32
columns, and float64
for all the other numeric types.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
datatable.setdiff()¶
Find the set difference between frame0
and the other frames
.
Each frame should have only a single column or be empty.
The values in each frame will be treated as a set, and this function will
compute the
set difference
between the frame0
and the union of the other
frames, returning those values that are present in the frame0
,
but not present in any of the frames
.
Parameters¶
Frame
Input single-column frame.
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns from the frames
.
NotImplementedError
The exception is raised when one frame columns has stype obj64
.
See Also¶
intersect()
– calculate the set intersection of values in the frames.symdiff()
– calculate the symmetric difference between the sets of values in the frames.union()
– calculate the union of values in the frames.unique()
– find unique values in a frame.
datatable.shift()¶
Produce a column obtained from col
shifting it n
rows forward.
The shift amount, n
, can be both positive and negative. If positive,
a “lag” column is created, if negative it will be a “lead” column.
The shifted column will have the same number of rows as the original
column, with n
observations in the beginning becoming missing, and
n
observations at the end discarded.
This function is group-aware, i.e. in the presence of a groupby it will perform the shift separately within each group.
datatable.sort()¶
Sort clause for use in Frame’s square-bracket selector.
When a sort()
object is present inside a DT[i, j, ...]
expression, it will sort the rows of the resulting Frame according
to the columns cols
passed as the arguments to sort()
.
When used together with by()
, the sort clause applies after the
group-by, i.e. we sort elements within each group. Note, however,
that because we use stable sorting, the operations of grouping and
sorting are commutative: the result of applying groupby and then sort
is the same as the result of sorting first and then doing groupby.
When used together with i
(row filter), the i
filter is
applied after the sorting. For example,:
DT[:10, :, sort(f.Highscore, reverse=True)]
will select the first 10 records from the frame DT
ordered by
the Highscore column.
datatable.split_into_nhot()¶
Split and nhot-encode a single-column frame.
Each value in the frame, having a single string column, is split according
to the provided separator sep
, the whitespace is trimmed, and
the resulting pieces (labels) are converted into the individual columns
of the output frame.
Parameters¶
Frame
An input single-column frame. The column stype must be either str32
or str64
.
str
Single-character separator to be used for splitting.
bool
An option to control whether the resulting column names, i.e. labels,
should be sorted. If set to True
, the column names are returned in
alphabetical order, otherwise their order is not guaranteed
due to the algorithm parallelization.
Frame
The output frame. It will have as many rows as the input frame, and as many boolean columns as there were unique labels found. The labels will also become the output column names.
ValueError
The exception is raised if the input frame is missing or it has more
than one column. It is also raised if sep
is not a single-character
string.
TypeError
The exception is raised if the single column of frame
has a type
different from string.
Examples¶
DT = dt.Frame(["cat,dog", "mouse", "cat,mouse", "dog,rooster", "mouse,dog,cat"])
C0 | |
---|---|
▪▪▪▪ | |
0 | cat,dog |
1 | mouse |
2 | cat,mouse |
3 | dog,rooster |
4 | mouse,dog,cat |
split_into_nhot(DT)
cat | dog | mouse | rooster | |
---|---|---|---|---|
▪ | ▪ | ▪ | ▪ | |
0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 |
2 | 1 | 0 | 1 | 0 |
3 | 0 | 1 | 0 | 1 |
4 | 1 | 1 | 1 | 0 |
datatable.symdiff()¶
Find the symmetric difference between the sets of values in all frames
.
Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the symmetric difference operation on these sets.
The symmetric difference of two frames are those values that are present in either of the frames, but not in the both. The symmetric difference of more than two frames are those values that are present in an odd number of frames.
Parameters¶
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns from the frames
.
ValueError
The exception is raised when one of the input frames has more than one column.
NotImplementedError
The exception is raised when one of the frame columns has stype obj64
.
See Also¶
intersect()
– calculate the set intersection of values in the frames.setdiff()
– calculate the set difference between the frames.union()
– calculate the union of values in the frames.unique()
– find unique values in a frame.
datatable.sum()¶
Calculate the sum of values for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names and number of columns
as in cols
. The column stypes are int64
for
boolean and integer columns, float32
for float32
columns
and float64
for float64
columns.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
datatable.union()¶
Find the union of values in all frames
.
Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the union operation on these sets.
The dt.union(*frames)
operation is equivalent to
dt.unique(dt.rbind(*frames))
.
Parameters¶
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns in the frames
.
ValueError
The exception is raised when one of the input frames has more than one column.
NotImplementedError
The exception is raised when one of the frame columns has stype obj64
.
See Also¶
intersect()
– calculate the set intersection of values in the frames.setdiff()
– calculate the set difference between the frames.symdiff()
– calculate the symmetric difference between the sets of values in the frames.unique()
– find unique values in a frame.
datatable.unique()¶
Find the unique values in all the columns of the frame
.
This function sorts the values in order to find the uniques, so the return values will be ordered. However, this should be considered an implementation detail: in the future datatable may switch to a different algorithm, such as hash-based, which may return the results in a different order.
Parameters¶
Frame
Input frame.
NotImplementedError
The exception is raised when one of the frame columns has stype obj64
.
See Also¶
intersect()
– calculate the set intersection of values in the frames.setdiff()
– calculate the set difference between the frames.symdiff()
– calculate the symmetric difference between the sets of values in the frames.union()
– calculate the union of values in the frames.
datatable.update()¶
Create new or update existing columns within a frame.
This expression is intended to be used at “j” place in DT[i, j]
call. It takes an arbitrary number of key/value pairs each describing
a column name and the expression for how that column has to be
created/updated.
Development¶
Creating a new FExpr¶
The majority of functions available from datatable
module are implemented
via the FExpr
mechanism. These functions have the same common API: they
accept one or more FExpr
s (or fexpr-like objects) as arguments and
produce an FExpr
as the output. The resulting FExpr
s can then be used
inside the DT[...]
call to apply these expressions to a particular frame.
In this document we describe how to create such FExpr
-based function. In
particular, we describe adding the gcd(a, b)
function for computing the
greatest common divisor of two integers.
C++ “backend” class¶
The core of the functionality will reside within a class derived from the
class dt::expr::FExpr
. So let’s create the file expr/fexpr_gcd.cc
and
declare the skeleton of our class:
#include "expr/fexpr_func.h"
#include "expr/eval_context.h"
#include "expr/workframe.h"
namespace dt {
namespace expr {
class FExpr_Gcd : public FExpr_Func {
private:
ptrExpr a_;
ptrExpr b_;
public:
FExpr_Gcd(ptrExpr&& a, ptrExpr&& b)
: a_(std::move(a)), b_(std::move(b)) {}
std::string repr() const override;
Workframe evaluate_n(EvalContext& ctx) const override;
};
}}
In this example we are inheriting from FExpr_Func
, which is a slightly more
specialized version of FExpr
.
You can also see that the two arguments in gcd(a, b)
are stored within the
class as ptrExpr a_, b_
. This ptrExpr
is actually a typedef for
std::shared_ptr<FExpr>
, which means that arguments to our FExpr
are
also FExpr
s.
The first method that needs to be implemented is repr()
, which is
more-or-less equivalent to python’s __repr__
. The returned string should
not have the name of the class in it, instead it must be ready to be combined
with reprs of other expressions:
std::string repr() const override {
std::string out = "gcd(";
out += a_->repr();
out += ", ";
out += b_->repr();
out += ')';
return out;
}
We construct our repr out of reprs of a_
and b_
. They are joined with
a comma, which has the lowest precedence in python. For some other FExprs we
may need to take into account the precedence of the arguments as well, in
order to properly set up parentheses around subexpressions.
The second method to implement is evaluate_n()
. The _n
suffix here
stands for “normal”. If you look into the source of FExpr
class, you’ll see
that there are other evaluation methods too: evaluate_i()
, evaluate_j()
,
etc. However all of those are not needed when implementing a simple function.
The method evaluate_n()
takes an EvalContext
object as the argument.
This object contains information about the current evaluation environment. The
output from evaluate_n()
should be a Workframe
object. A workframe can
be thought of as a “work-in-progress” frame. In our case it is sufficient to
treat it as a simple vector of columns.
We begin implementing evaluate_n()
by evaluating the arguments a_
and
b_
and then making sure that those frames are compatible with each other
(i.e. have the same number of columns and rows). After that we compute the
result by iterating through the columns of both frames and calling a simple
method evaluate1(Column&&, Column&&)
(that we still need to implement):
Workframe evaluate_n(EvalContext& ctx) const override {
Workframe awf = a_->evaluate_n(ctx);
Workframe bwf = b_->evaluate_n(ctx);
if (awf.ncols() == 1) awf.repeat_column(bwf.ncols());
if (bwf.ncols() == 1) bwf.repeat_column(awf.ncols());
if (awf.ncols() != bwf.ncols()) {
throw TypeError() << "Incompatible number of columns in " << repr()
<< ": the first argument has " << awf.ncols() << ", while the "
<< "second has " << bwf.ncols();
}
awf.sync_grouping_mode(bwf);
auto gmode = awf.get_grouping_mode();
Workframe outputs(ctx);
for (size_t i = 0; i < awf.ncols(); ++i) {
Column rescol = evaluate1(awf.retrieve_column(i),
bwf.retrieve_column(i));
outputs.add_column(std::move(rescol), std::string(), gmode);
}
return outputs;
}
The method evaluate1()
will take a pair of two columns and produce
the output column containing the result of gcd(a, b)
calculation. We must
take into account the stypes of both columns, and decide which stypes are
acceptable for our function:
Column evaluate1(Column&& a, Column&& b) const {
SType stype1 = a.stype();
SType stype2 = b.stype();
SType stype0 = common_stype(stype1, stype2);
switch (stype0) {
case SType::BOOL:
case SType::INT8:
case SType::INT16:
case SType::INT32: return make<int32_t>(std::move(a), std::move(b), SType::INT32);
case SType::INT64: return make<int64_t>(std::move(a), std::move(b), SType::INT64);
default:
throw TypeError() << "Invalid columns of types " << stype1 << " and "
<< stype2 << " in " << repr();
}
}
template <typename T>
Column make(Column&& a, Column&& b, SType stype0) const {
a.cast_inplace(stype0);
b.cast_inplace(stype0);
return Column(new Column_Gcd<T>(std::move(a), std::move(b)));
}
As you can see, the job of the FExpr_Gcd
class is to produce a workframe
containing one or more Column_Gcd
virtual columns. This is where the actual
calculation of GCD values will take place, and we shall declare this class too.
It can be done either in a separate file in the core/column/ folder, or
inside the current file expr/fexpr_gcd.cc.
#include "column/virtual.h"
template <typename T>
class Column_Gcd : public Virtual_ColumnImpl {
private:
Column acol_;
Column bcol_;
public:
Column_Gcd(Column&& a, Column&& b)
: Virtual_ColumnImpl(a.nrows(), a.stype()),
acol_(std::move(a)), bcol_(std::move(b))
{
xassert(acol_.nrows() == bcol_.nrows());
xassert(acol_.stype() == bcol_.stype());
xassert(compatible_type<T>(acol_.stype()));
}
ColumnImpl* clone() const override {
return new Column_Gcd(Column(acol_), Column(bcol_));
}
size_t n_children() const noexcept { return 2; }
const Column& child(size_t i) { return i==0? acol_ : bcol_; }
bool get_element(size_t i, T* out) {
T a, b;
bool avalid = acol_.get_element(i, &a);
bool bvalid = bcol_.get_element(i, &b);
if (avalid && bvalid) {
while (b) {
T tmp = b;
b = a % b;
a = tmp;
}
*out = a;
return true;
}
return false;
}
};
Python-facing gcd()
function¶
Now that we have created the FExpr_Gcd
class, we also need to have a python
function responsible for creating these objects. This is done in 4 steps:
First, declare a function with signature py::oobj(const py::PKArgs&)
. The
py::PKArgs
object here encapsulates all parameters that were passed to the
function, and it returns a py::oobj
, which is a simple wrapper around
python’s PyObject*
.
static py::oobj py_gcd(const py::XArgs& args) {
auto a = args[0].to_oobj();
auto b = args[1].to_oobj();
return PyFExpr::make(new FExpr_Gcd(as_fexpr(a), as_fexpr(b)));
}
This function takes the python arguments, if necessary validates and converts
them into C++ objects, then creates a new FExpr_Gcd
object, and then
returns it wrapped into a PyFExpr
(which is a python equivalent of the
generic FExpr
class).
In the second step, we declare the signature and the docstring of this python function:
static const char* doc_gcd =
R"(gcd(a, b)
--
Compute the greatest common divisor of `a` and `b`.
Parameters
----------
a, b: FExpr
Only integer columns are supported.
return: FExpr
The returned column will have stype int64 if either `a` or `b` are
of type int64, or otherwise it will be int32.
)";
DECLARE_PYFN(&py_gcd)
->name("gcd")
->docs(doc_gcd)
->arg_names({"a", "b"})
->n_positional_args(2)
->n_required_args(2);
At this point the method will be visible from python in the _datatable
module. So the next step is to import it into the main datatable
module.
To do this, go to src/datatable/__init__.py
and write
from .lib._datatable import (
...
gcd,
...
)
...
__all__ = (
...
"gcd",
...
)
Tests¶
Any functionality must be properly tested. We recommend creating a dedicated
test file for each new function. Thus, create file tests/expr/test-gcd.py
and add some tests in it. We use the pytest
framework for testing. In this
framework, each test is a single function (whose name starts with test_
)
which performs some actions and then asserts the validity of results.
import pytest
import random
from datatable import dt, f, gcd
from tests import assert_equals # checks equality of Frames
from math import gcd as math_gcd
def test_equal_columns():
DT = dt.Frame(A=[1, 2, 3, 4, 5])
RES = DT[:, gcd(f.A, f.A)]
assert_equals(RES, dt.Frame([1, 1, 1, 1, 1]/dt.int32))
@pytest.mark.parametrize("seed", [random.getrandbits(63)])
def test_random(seed):
random.seed(seed)
n = 100
src1 = [random.randint(1, 1000) for i in range(n)]
src2 = [random.randint(1, 100) for i in range(n)]
DT = dt.Frame(A=src1, B=src2)
RES = DT[:, gcd(f.A, f.B)]
assert_equals(RES, dt.Frame([math_gcd(src1[i], src2[i])
for i in range(n)]))
When writing tests try to test any corner cases that you can think of. For example, what if one of the numbers is 0? Negative? Add tests for various column types, including invalid ones.
Documentation¶
The final piece of the puzzle is the documentation. We’ve already written the
documentation for our function: the doc_gcd
variable declared earlier.
However, for now this is only visible from python when you run help(gcd)
.
We also want the documentation to be visible on our official readthedocs
website, which requires a few more steps. So:
First, create file docs/api/dt/gcd.rst
. The content of the file should
contain just few lines:
.. xfunction:: datatable.gcd
:doc: src/core/fexpr/fexpr_gcd.cc doc_gcd
:src: src/core/fexpr/fexpr_gcd.cc py_gcd
:tests: tests/expr/test-gcd.py
In these lines we declare: in which source file the docstring can be found,
and what is the name of its variable. The documentation generator will be
looking for a static const char* doc_gcd
variable in the source. Then
we also declare the name of the function which provides the gcd functionality.
The generator will look for a function with that name in the specified source
file and create a link to that source in the output doc file. Lastly, the
:tests:
parameter says which file contains tests dedicated to this
function, this will also become a link in the generated documentation.
This RST file now needs to be added to the toctree: open the file
docs/api/index-api.rst
and add it into the .. toctree::
list at the
bottom, and also add it to the table of all functions.
Lastly, open docs/releases/v{LATEST}.rst
(this is our changelog) and write
a brief paragraph about the new function:
Frame
-----
...
-[new] Added new function :func:`gcd()` to compute the greatest common
divisor of two columns. [#NNNN]
The [#NNNN]
is a link to the GitHub issue where the gcd()
function
was requested.
Submodules¶
Some functions are declared within submodules of the datatable module. For
example, math-related functions can be found in dt.math
, string functions
in dt.str
, etc. Declaring such functions is not much different from what
is described above. For example, if we wanted our gcd()
function to be
in the dt.math
submodule, we’d made the following changes:
Create file
expr/math/fexpr_gcd.cc
instead ofexpr/fexpr_gcd.cc
;Instead of importing the function in
src/datatable/__init__.py
we’d have imported it fromsrc/datatable/math.py
;The test file name can be
tests/math/test-gcd.py
instead oftests/expr/test-gcd.py
;The doc file name can be
docs/api/math/gcd.rst
instead ofdocs/api/dt/gcd.rst
, and it should be added to the toctree indocs/api/math.rst
.
Release History¶
Contributors¶
This page lists all people who have contributed to the development of
datatable
. We take into account both code and documentation contributions,
as well as contributions in the form of bug reports and feature requests.
More specifically, a code contribution is considered any PR (pull request) that was merged into the codebase. The “complexity” of the PR is not taken into account as it is highly subjective. Next, an issue contribution is any closed issue except for those that are tagged as “question”, “wont-fix” or “cannot-reproduce”. Issues are attributed according to their closing date, not their creation date.
In the table, the contributors are sorted according to their total contribution score, which is the weighted sum of the count of each user’s code and issue contributions. Code contributions have more weight than issue contributions, and more recent contributions more weight than the older ones.
Developer’s note: This table is auto-generated based on contributor lists
in each of the version files, specified via the ..contributors::
directive.
In turn, the list of contributors for each version has to be generated via
the script ci/gh.py
at the time of each release. The issues/PRs will be
filtered according to their milestone. Thus, the issues/PRs that are not tagged
with any milestone will not be taken into account.