Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data processing, and flexible API.
Getting Started¶
Installation¶
This page describes how to install datatable
on various systems.
Prerequisites¶
Python 3.6+ is required. Generally, we will support each version of Python until its official end of life. You can verify your python version via
$ python --version
Python 3.6.6
In addition, we recommend using pip
version 20.3+, especially if you’re
planning to install datatable from the source, or if you are on a Unix machine.
$ pip install pip --upgrade
Collecting pip
Using cached pip-21.1.2-py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.2.4
Uninstalling pip-20.2.4:
Successfully uninstalled pip-20.2.4
Successfully installed pip-21.1.2
There are no other prerequisites. Datatable does not depend on any other python module 1, nor on any non-standard system library.
Basic installation¶
On most platforms datatable
can be installed directly from PyPI using
pip
:
$ pip install datatable
The following platforms are supported:
macOS
Datatable has been tested to work on macOS 10.12.5 (Sierra), macoS 10.13.6 (High Sierra), macOS 10.15.7 (Catalina), and macOS 11.2.3 (BigSur). The produced wheels are tagged as
macosx_10_9
, so they should work on earlier versions of macOS as well.Linux x86_64 / ppc64le
We produce binary wheels that are tagged as
manylinux_2_12
(forx86_64
architecture) andmanylinux2014
(forppc64le
). Consequently, they will work with your Linux distribution if it is compatible with one of these tags. Please refer to PEP 600 for details.Windows
Windows wheels are available for Windows 10 or later.
Install latest dev version¶
If you wish to test the latest version of datatable
before it has been
officially released, then you can use one of the binary wheels that we build
as part of our Continuous Integration process.
If you are on Windows, then pre-built wheels are available on AppVeyor. Click on a green main build of your choice, then navigate to the “Artifacts” tab, copy the wheel URL that corresponds to your Python version, and finally install it as:
C:\> pip install YOUR_WHEEL_URL
For macOS and Linux, development wheels can be found at our S3 repository.
Scroll to the bottom of the page to find the latest links, and then download
or copy the URL of a wheel that corresponds to your Python version and
platform. This wheel can be installed with pip
as usual:
$ pip install YOUR_WHEEL_URL
Alternatively, you can instruct pip
to go to that repository directly
and find the latest version automatically:
$ pip install --trusted-host h2o-release.s3-website-us-east-1.amazonaws.com \
-i http://h2o-release.s3-website-us-east-1.amazonaws.com/ datatable
Build from source¶
In order to build and install the latest development version of datatable
directly from GitHub, run the following command:
$ pip install git+https://github.com/h2oai/datatable
Since datatable
is written mostly in C++, your computer must be set up for
compiling C++ code. The build script will attempt to find the compiler
automatically, searching for GCC, Clang, or MSVC on Windows. If it fails, or
if you want to use some other compiler, then set environment variable CXX
before building the code.
Datatable uses C++14 language standard, which means you must use the compiler that fully implements this standard. The following compiler versions are known to work:
Clang 5+;
GCC 6+;
MSVC 19.14+.
Install datatable in editable mode¶
If you want to tweak certain features of datatable
, or even add your
own functionality, you are welcome to do so. This section describes how
to install datatable for development process.
First, you need to fork the repository and then clone it locally:
$ git clone https://github.com/your_user_name/datatable $ cd datatable
Build
_datatable
core library. The two most common options are:$ # build a "production mode" datatable $ make build $ # build datatable in "debug" mode, without optimizations and with $ # internal asserts enabled $ make debug
Note that you would need to have a C++ compiler in order to compile and link the code. Please refer to the previous section for compiler requirements.
On macOS you may also need to install Xcode Command Line Tools.
On Linux if you see an error that
'Python.h' file not found
, then it means you need to install a “development” version of Python, i.e. the one that has python header files included.After the previous step succeeds, you will have a
_datatable.*.so
file in thesrc/datatable/lib
folder. Now, in order to makedatatable
usable from Python, run$ echo "`pwd`/src" >> ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth
(This assumes that you are using a virtualenv-based python. If not, then you’ll need to adjust the path to your python’s
site-packages
directory).Install additional libraries that are needed to test datatable:
$ pip install -r requirements_tests.txt $ pip install -r requirements_extra.txt $ pip install -r requirements_docs.txt
Check that everything works correctly by running the test suite:
$ make test
Once these steps are completed, subsequent development process is much simpler.
After any change to C++ files, re-run make build
(or make debug
) and
then restart python for the changes to take effect.
Datatable only recompiles those files that were modified since the last time,
which means that usually the compile step takes only few seconds. Also note
that you can switch between the “build” and “debug” versions of the library
without performing make clean
.
Troubleshooting¶
Despite our best effort to keep the installation process hassle-free, sometimes
problems may still arise. Here we list some of the more frequent ones, where we
know how to resolve them. If none of these help you, please ask a question on
StackOverflow (tagging with [py-datatable]
), or file an issue on
GitHub.
pip._vendor.pep517.wrappers.BackendUnavailable
This error occurs when you have an old version of
pip
in your environment. Please upgradepip
to the version 20.3+, and the error should disappear.ImportError: cannot import name '_datatable'
This means the internal core library
_datatable.*.so
is either missing entirely, is in a wrong location, or has the wrong name. The first step is therefore to find where that file actually is. Use the systemfind
tool, limiting the search to your python directory.If the file is missing entirely, then it was either deleted, or installation used a broken wheel file. In either case, the only solution is to rebuild or reinstall the library completely.
If the file is present but not within the
site-packages/datatable/lib/
directory, then moving it there should solve the issue.If the file is present and is in the correct directory, then there must be a name conflict. In python run:
import sysconfig sysconfig.get_config_var("SOABI")
'cpython-36m-ppc64le-linux-gnu'The reported suffix should match the suffix of the
_datatable.*.so
file. If it doesn’t, then renaming the file will fix the problem.Python.h: no such file or directory
when compiling from sourceYour Python distribution was shipped without the
Python.h
header file. This has been observed on certain Linux machines. You would need to install a Python package with a-dev
suffix, for examplepython3.6-dev
.fatal error: 'sys/mman.h' file not found
on macOSIn order to compile from source on mac computers, you need to have Xcode Command Line Tools installed. Run:
$ xcode-select --install
ImportError: This package should not be accessible
The most likely cause of this error is a misconfigured
PYTHONPATH
environment variable. Unset that variable and try again.
Footnotes
- 1
Since version v0.11.0
Getting started¶
Install datatable¶
Let’s begin by installing the latest stable version of datatable
from PyPI:
$ pip install datatable
If this didn’t work for you, or if you want to install the bleeding edge version of the library, please check the Installation page.
Assuming the installation was successful, you can now import the library in a JupyterLab notebook or in a Python console:
import datatable as dt
print(dt.__version__)
Loading data¶
The fundamental unit of analysis in datatable is a data Frame
. It is the
same notion as a pandas DataFrame or SQL table: data arranged in a
two-dimensional array with rows and columns.
You can create a Frame
object from a variety of data sources: from a python
list or dictionary, from a numpy array, or from a pandas DataFrame:
DT1 = dt.Frame(A=range(5), B=[1.7, 3.4, 0, None, -math.inf],
stypes={"A": dt.int64})
DT2 = dt.Frame(pandas_dataframe)
DT3 = dt.Frame(numpy_array)
You can also load a CSV/text/Excel file, or open a previously saved binary
.jay
file:
DT4 = dt.fread("~/Downloads/dataset_01.csv")
DT5 = dt.open("data.jay")
The fread()
function shown above is both powerful and extremely fast. It can
automatically detect parse parameters for the majority of text files, load data
from .zip archives or URLs, read Excel files, and much more.
Data manipulation¶
Once the data is loaded into a Frame, you may want to do certain operations with it: extract/remove/modify subsets of the data, perform calculations, reshape, group, join with other datasets, etc. In datatable, the primary vehicle for all these operations is the square-bracket notation inspired by traditional matrix indexing but overcharged with power (this notation was pioneered in R data.table and is the main axis of intersection between these two libraries).
In short, almost all operations with a Frame can be expressed as
DT[i, j, ...]
where i
is the row selector, j
is the column selector, and ...
indicates
that additional modifiers might be added. If this looks familiar to you,
that’s because it is. Exactly the same DT[i, j]
notation is used in
mathematics when indexing matrices, in C/C++, in R, in pandas, in numpy, etc.
The only difference that datatable introduces is that it allows
i
to be anything that can conceivably be
interpreted as a row selector: an integer to select just one row, a slice,
a range, a list of integers, a list of slices, an expression, a boolean-valued
Frame, an integer-valued Frame, an integer numpy array, a generator, and so on.
The j
column selector is even more versatile.
In the simplest case, you can select just a single column by its index or name. But
also accepted are a list of columns, a slice, a string slice (of the form "A":"Z"
), a
list of booleans indicating which columns to pick, an expression, a list of
expressions, and a dictionary of expressions. (The keys will be used as new names
for the columns being selected.) The j
expression can even be a python type (such as int
or dt.float32
),
selecting all columns matching that type.
In addition to the selector expression shown above, we support the update and delete statements too:
DT[i, j] = r
del DT[i, j]
The first expression will replace values in the subset [i, j]
of Frame
DT
with the values from r
, which could be either a constant, or a
suitably-sized Frame, or an expression that operates on frame DT
.
The second expression deletes values in the subset [i, j]
. This is
interpreted as follows: if i
selects all rows, then the columns given by
j
are removed from the Frame; if j
selects all columns, then the rows
given by i
are removed; if neither i
nor j
span all rows/columns
of the Frame, then the elements in the subset [i, j]
are replaced with
NAs.
What the f.?¶
You may have noticed already that we mentioned several times the possibility
of using expressions in i
or j
and in other places. In the simplest form
an expression looks like:
f.ColA
which indicates a column ColA
in some Frame. Here f
is a variable that
has to be imported from the datatable module. This variable provides a convenient
way to reference any column in a Frame. In addition to the notation above, the
following is also supported:
f[3]
f["ColB"]
denoting the fourth column and the column ColB
respectively.
These f-expression support arithmetic operations as well as various mathematical and
aggregate functions. For example, in order to select the values from column
A
normalized to range [0; 1]
we can write the following:
from datatable import f, min, max
DT[:, (f.A - min(f.A))/(max(f.A) - min(f.A))]
This is equivalent to the following SQL query:
SELECT (f.A - MIN(f.A))/(MAX(f.A) - MIN(f.A)) FROM DT AS f
So, what exactly is f
? We call it a “frame proxy”, as it becomes a
simple way to refer to the Frame that we currently operate on. More precisely,
whenever DT[i, j]
is evaluated and we encounter an f
-expression there,
that f
becomes replaced with the frame DT
, and the columns are looked
up on that Frame. The same expression can later on be applied to a different
Frame, and it will refer to the columns in that other Frame.
At some point you may notice that that datatable also exports symbol g
. This
g
is also a frame proxy; however it already refers to the second frame in
the evaluated expression. This second frame appears when you are joining two
or more frames together (more on that later). When that happens, symbol g
is
used to refer to the columns of the joined frame.
Groupbys/joins¶
In the Data Manipulation section we mentioned that the DT[i, j, ...]
selector
can take zero or more modifiers, which we denoted as “...
”. The available
modifiers are by()
, join()
and sort()
. Thus, the full form of the
square-bracket selector is:
DT[i, j, by(), sort(), join()]
by(…)¶
This modifier splits the frame into groups by the provided column(s), and then
applies i
and j
within
each group. This mostly affects aggregator functions such as sum()
,
min()
or sd()
, but may also apply in other circumstances. For example,
if i
is a slice that takes the first 5 rows of a frame,
then in the presence of the by()
modifier it will take the first 5 rows of
each group.
For example, in order to find the total amount of each product sold, write:
from datatable import f, by, sum
DT = dt.fread("transactions.csv")
DT[:, sum(f.quantity), by(f.product_id)]
sort(…)¶
This modifier controls the order of the rows in the result, much like SQL clause
ORDER BY
. If used in conjunction with by()
, it will order the rows
within each group.
join(…)¶
As the name suggests, this operator allows you to join another frame to the
current, equivalent to the SQL JOIN
operator. Currently we support only
left outer joins.
In order to join frame X
, it must be keyed. A keyed frame is conceptually
similar to a SQL table with a unique primary key. This key may be either a
single column, or several columns:
X.key = "id"
Once a frame is keyed, it can be joined to another frame DT
, provided that
DT
has the column(s) with the same name(s) as the key in X
:
DT[:, :, join(X)]
This has the semantics of a natural left outer join. The X
frame can be
considered as a dictionary, where the key column contains the keys, and all
other columns are the corresponding values. Then during the join each row of
DT
will be matched against the row of X
with the same value of the
key column, and if there are no such value in X
, with an all-NA row.
The columns of the joined frame can be used in expressions using the g.
prefix, for example:
DT[:, sum(f.quantity * g.price), join(products)]
Note
In the future, we will expand the syntax of the join operator to allow other kinds of joins and also to remove the limitation that only keyed frames can be joined.
Offloading data¶
Just as our work has started with loading some data into datatable
, eventually
you will want to do the opposite: store or move the data somewhere else. We
support multiple mechanisms for this.
First, the data can be converted into a pandas DataFrame or into a numpy array. (Obviously, you have to have pandas or numpy libraries installed.):
DT.to_pandas()
DT.to_numpy()
A frame can also be converted into python native data structures: a dictionary, keyed by the column names; a list of columns, where each column is itself a list of values; or a list of rows, where each row is a tuple of values:
DT.to_dict()
DT.to_list()
DT.to_tuples()
You can also save a frame into a CSV file, or into a binary .jay file:
DT.to_csv("out.csv")
DT.to_jay("data.jay")
Using datatable¶
This section describes common functionality and commands that you can run in datatable
.
Create Frame¶
You can create a Frame from a variety of sources, including numpy
arrays,
pandas
DataFrames, raw Python objects, etc:
import datatable as dt
import numpy as np
np.random.seed(1)
dt.Frame(np.random.randn(1000000))
C0 | ||
---|---|---|
float64 | ||
0 | 1.62435 | |
1 | -0.611756 | |
2 | -0.528172 | |
3 | -1.07297 | |
4 | 0.865408 | |
5 | -2.30154 | |
6 | 1.74481 | |
7 | -0.761207 | |
8 | 0.319039 | |
9 | -0.24937 | |
10 | 1.46211 | |
11 | -2.06014 | |
12 | -0.322417 | |
13 | -0.384054 | |
14 | 1.13377 | |
… | … | |
999995 | 0.0595784 | |
999996 | 0.140349 | |
999997 | -0.596161 | |
999998 | 1.18604 | |
999999 | 0.313398 |
import pandas as pd
pf = pd.DataFrame({"A": range(1000)})
dt.Frame(pf)
A | ||
---|---|---|
int64 | ||
0 | 0 | |
1 | 1 | |
2 | 2 | |
3 | 3 | |
4 | 4 | |
5 | 5 | |
6 | 6 | |
7 | 7 | |
8 | 8 | |
9 | 9 | |
10 | 10 | |
11 | 11 | |
12 | 12 | |
13 | 13 | |
14 | 14 | |
… | … | |
995 | 995 | |
996 | 996 | |
997 | 997 | |
998 | 998 | |
999 | 999 |
dt.Frame({"n": [1, 3], "s": ["foo", "bar"]})
n | s | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | foo | |
1 | 3 | bar |
Convert a Frame¶
Convert an existing Frame into a numpy
array, a pandas
DataFrame,
or a pure Python object:
nparr = DT.to_numpy()
pddfr = DT.to_pandas()
pyobj = DT.to_list()
Parse Text (csv) Files¶
datatable
provides fast and convenient parsing of text (csv) files:
DT = dt.fread("train.csv")
The datatable
parser
Automatically detects separators, headers, column types, quoting rules, etc.
Reads from file, URL, shell, raw text, archives, glob
Provides multi-threaded file reading for maximum speed
Includes a progress indicator when reading large files
Reads both RFC4180-compliant and non-compliant files
Write the Frame¶
Write the Frame’s content into a csv
file (also multi-threaded):
DT.to_csv("out.csv")
Save a Frame¶
Save a Frame into a binary format on disk, then open it later instantly, regardless of the data size:
DT.to_jay("out.jay")
DT2 = dt.open("out.jay")
Basic Frame Properties¶
Basic Frame properties include:
print(DT.shape) # (nrows, ncols)
print(DT.names) # column names
print(DT.stypes) # column types
Compute Per-Column Summary Stats¶
Compute per-column summary stats using:
DT.sum()
DT.max()
DT.min()
DT.mean()
DT.sd()
DT.mode()
DT.nmodal()
DT.nunique()
Select Subsets of Rows/Columns¶
Select subsets of rows and/or columns using:
DT[:, "A"] # select 1 column
DT[:10, :] # first 10 rows
DT[::-1, "A":"D"] # reverse rows order, columns from A to D
DT[27, 3] # single element in row 27, column 3 (0-based)
Delete Rows/Columns¶
Delete rows and or columns using:
del DT[:, "D"] # delete column D
del DT[f.A < 0, :] # delete rows where column A has negative values
Filter Rows¶
Filter rows via an expression using the following. In this example, mean
,
sd
, f
are all symbols imported from datatable
:
DT[(f.x > mean(f.y) + 2.5 * sd(f.y)) | (f.x < -mean(f.y) - sd(f.y)), :]
Compute Columnar Expressions¶
Compute columnar expressions using:
DT[:, {"x": f.x, "y": f.y, "x+y": f.x + f.y, "x-y": f.x - f.y}]
Append Rows/Columns¶
Append rows/columns to a Frame using Frame.cbind()
:
DT1.cbind(DT2, DT3)
DT1.rbind(DT4, force=True)
User Guide¶
Name mangling¶
Column names in a Frame
satisfy several invariants:
they are all non-empty strings;
within a single Frame column names must be unique;
no column name may contain characters from the ASCII C0 control block. This set of forbidden characters includes: the NULL character
\0
, TAB character\t
, newline\n
, and similar.
If the user makes an attempt to create a Frame that would violate some of these assumptions, then instead of failing we will attempt to mangle the provided names, forcing them to satisfy the above requirements.
Given a list of column names requested by the user, the following algorithm is used:
First, we check all the non-empty names in the list, from left to right. If a name contains characters in the range
\x00-\x1F
, then every run of 1 or more such characters is replaced with a single dot.Once the special characters are removed from the name, we check it against the set of names that were already encountered. If the current name hasn’t been seen before, then we add it to the final list of names and proceed to consider the next name in the list. However, if the name was seen before, then it goes into the deduplication stage.
When a name needs to be deduplicated, we do the following:
If the name ends with a number, then split it into two parts: the
stem
and the numeric suffix. Letcount
be the value of the numeric suffix plus 1;If the name does not end with a number, then append a dot (
.
) to the name and consider this thestem
. For thecount
variable, take the value of optiondt.options.frame.name_auto_index
.Concatenate
stem
andcount
, and check whether this name has been seen before. If it was, then incrementcount
by 1, and repeat this step.Use
stem + count
as this column’s final name. Continue processing other columns.
Finally, re-scan the list of column names once again, this time replacing all the empty names. For each empty name we proceed exactly as in (3), using
dt.options.frame.name_auto_prefix
as thestem
, anddt.options.frame.name_auto_index
as the initialcount
.
Examples¶
The default value of dt.options.frame.name_auto_prefix
is "C"
, and the
default value of dt.options.frame.name_auto_index
is 0
. This means that
if no column names are given, they will be named as C0, C1, C2, ...
:
dt.Frame([[]] * 5).names
If the column names contain duplicates, then they will gain a numeric suffix (or reuse the existing suffix, if any):
dt.Frame(names=["A", "A", "A"]).names
dt.Frame(names=["R3"] * 4).names
If some of the column names are given, while others are missing, then the
missing names will be filled as C0, C1, ...
:
dt.Frame(names=["A", None, "B", None]).names
When replacing the missing names, explicitly given names will have a higher precedence and tend to retain their names:
dt.Frame(names=["A", None, "C0", "C1"]).names
However, deduplication of the existing names happen from left to right, which may affect the subsequent columns:
dt.Frame(names=["A1", "A1", "A2", "A3"]).names
f
-expressions¶
The datatable
module exports a special symbol f
, which can be used
to refer to the columns of a frame currently being operated on. If this sounds
cryptic, consider that the most common way to operate on a frame is via the
square-bracket call DT[i, j, by, ...]
. It is often the case that within
this expression you would want to refer to individual columns of the frame:
either to create a filter, a transform, or specify a grouping variable, etc.
In all such cases the f
symbol is used, and it is considered to be
evaluated within the context of the frame DT
.
For example, consider the expression:
f.price
By itself, it just means a column named “price”, in an unspecified frame. This expression becomes concrete, however, when used with a particular frame. For example:
train_dt[f.price > 0, :]
selects all rows in train_dt
where the price is positive. Thus, within the
call to train_dt[...]
, the symbol f
refers to the frame train_dt
.
The standalone f-expression may occasionally be useful too: it can be saved in
a variable and then re-applied to several different frames. Each time f
will refer to the frame to which it is being applied:
price_filter = (f.price > 0)
train_filtered = train_dt[price_filter, :]
test_filtered = test_dt[price_filter, :]
The simple expression f.price
can be saved in a variable too. In fact,
there is a Frame helper method .export_names()
which does exactly that:
returns a tuple of variables for each column name in the frame, allowing you to
omit the f.
prefix:
Id, Price, Quantity = DT.export_names()
DT[:, [Id, Price, Quantity, Price * Quantity]]
Single-column selector¶
As you have seen, the expression f.NAME
refers to a column called “NAME”.
This notation is handy, but not universal. What do you do if the column’s name
contains spaces or unicode characters? Or if a column’s name is not known, only
its index? Or if the name is in a variable? For these purposes f
supports
the square-bracket selectors:
f[-1] # select the last column
f["Price ($)"] # select column names "Price ($)"
Generally, f[i]
means either the column at index i
if i
is an
integer, or the column with name i
if i
is a string.
Using an integer index follows the standard Python rule for list subscripts:
negative indices are interpreted as counting from the end of the frame, and
requesting a column with an index outside of [-ncols; ncols)
will raise
an error.
This square-bracket form is also useful when you want to access a column
dynamically, i.e. if its name is not known in advance. For example, suppose
there is a frame with columns "2017_01"
, "2017_02"
, …, "2019_12"
.
Then all these columns can be addressed as:
[f["%d_%02d" % (year, month)]
for month in range(1, 13)
for year in [2017, 2018, 2019]]
Multi-column selector¶
In the previous section you have seen that f[i]
refers to a single column
when i
is either an integer or a string. However we alo support the case
when i
is a slice or a type:
f[:] # select all columns
f[::-1] # select all columns in reverse order
f[:5] # select the first 5 columns
f[3:4] # select the fourth column
f["B":"H"] # select columns from B to H, inclusive
f[int] # select all integer columns
f[float] # select all floating-point columns
f[dt.str32] # select all columns with stype `str32`
f[None] # select no columns (empty columnset)
In all these cases a columnset is returned. This columnset may contain a variable number of columns or even no columns at all, depending on the frame to which this f-expression is applied.
Applying a slice to symbol f
follows the same semantics as if f
was a
list of columns. Thus f[:10]
means the first 10 columns of a frame, or all
columns if the frame has less than 10. Similarly, f[9:10]
selects the 10th
column of a frame if it exists, or nothing if the frame has less than 10
columns. Compare this to selector f[9]
, which also selects the 10th column
of a frame if it exists, but throws an exception if it doesn’t.
Besides the usual numeric ranges, you can also use name ranges. These ranges
include the first named column, the last named column, and all columns in
between. It is not possible to mix positional and named columns in a range,
and it is not possible to specify a step. If the range is x:y
, yet column
x
comes after y
in the frame, then the columns will be selected in the
reverse order: first x
, then the column preceding x
, and so on, until
column y
is selected last:
f["C1":"C9"] # Select columns from C1 up to C9
f["C9":"C1"] # Select columns C9, C8, C7, ..., C2, C1
f[:"C3"] # Select all columns up to C3
f["C5":] # Select all columns after C5
Finally, you can select all columns of a particular type by using that type
as an f-selector. You can pass either common python types bool
, int
,
float
, str
; or you can pass an stype such as dt.int32
, or an ltype such as
dt.ltype.obj
. You can also pass None to not select any columns. By itself
this may not be very useful, but occasionally you may need this as a fallback
in conditional expressions:
f[int if select_types == "integer" else
float if select_types == "floating" else
None] # otherwise don't select any columns
A columnset can be used in situations where a sequence of columns is expected, such as:
the
j
node ofDT[i,j,...]
;within
by()
andsort()
functions;with certain functions that operate on sequences of columns:
rowsum()
,rowmean
,rowmin
, etc;many other functions that normally operate on a single column will automatically map over all columns in columnset:
sum(f[:]) # equivalent to [sum(f[i]) for i in range(DT.ncols)] f[:3] + f[-3:] # same as [f[0]+f[-3], f[1]+f[-2], f[2]+f[-1]]
Modifying a columnset¶
Columnsets support operations that either add or remove elements from the set.
This is done using methods .extend()
and .remove()
.
The .extend()
method takes a columnset as an argument (also a list, or dict,
or sequence of columns) and produces a new columnset containing both the
original and the new columns. The columns need not be unique: the same column
may appear multiple times in a columnset. This method allows to add transformed
columns into the columnset as well:
f[int].extend(f[float]) # integer and floating-point columns
f[:3].extend(f[-3:]) # the first and the last 3 columns
f.A.extend(f.B) # columns "A" and "B"
f[str].extend(dt.str32(f[int])) # string columns, and also all integer
# columns converted to strings
# All columns, and then one additional column named 'cost', which contains
# column `price` multiplied by `quantity`:
f[:].extend({"cost": f.price * f.quantity})
When a columnset is extended, the order of the elements is preserved. Thus, a columnset is closer in functionality to a python list than to a set. In addition, some of the elements in a columnset can have names if the columnset is created from a dictionary. The names may be non-unique too.
The .remove()
method is the opposite of .extend()
: it takes an existing
columnset and then removes all columns that are passed as the argument:
f[:].remove(f[str]) # all columns except columns of type string
f[:10].remove(f.A) # the first 10 columns without column "A"
f[:].remove(f[3:-3]) # same as `f[:3].extend(f[-3:])`, at least in the
# context of a frame with 6+ columns
Removing a column that is not in the columnset is not considered an error,
similar to how set-difference operates. Thus, f[:].remove(f.A)
may be
safely applied to a frame that doesn’t have column “A”: the columns that cannot
be removed are simply ignored.
If a columnset includes some column several times, and then you request to
remove that column, then only the first occurrence in the sequence will be
removed. Generally, the multiplicity of some column “A” in columnset
cs1.remove(cs2)
will be equal to the multiplicity of “A” in cs1
minus the
multiplicity of “A” in cs2
, or 0 if such difference would be negative.
Thus,:
f[:].extend(f[int]).remove(f[int])
will have the effect of moving all integer columns to the end of the columnset
(since .remove()
removes the first occurrence of a column it finds).
It is not possible to remove a transformed column from a columnset. An error
will be thrown if the argument of .remove()
contains any transformed
columns.
Fread Examples¶
This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.
See fread()
for all the available parameters.
Note: If you wish to read in multiple files, use iread()
; it
returns an iterator of Frames.
Read data¶
Read from a text file:
from datatable import dt, fread
fread('iris.csv')
sepal_length | sepal_width | petal_length | petal_width | species | ||
---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | ||
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa | |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa | |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | |
7 | 5 | 3.4 | 1.5 | 0.2 | setosa | |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | |
9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa | |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa | |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa | |
12 | 4.8 | 3 | 1.4 | 0.1 | setosa | |
13 | 4.3 | 3 | 1.1 | 0.1 | setosa | |
14 | 5.8 | 4 | 1.2 | 0.2 | setosa | |
… | … | … | … | … | … | |
145 | 6.7 | 3 | 5.2 | 2.3 | virginica | |
146 | 6.3 | 2.5 | 5 | 1.9 | virginica | |
147 | 6.5 | 3 | 5.2 | 2 | virginica | |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | |
149 | 5.9 | 3 | 5.1 | 1.8 | virginica |
Read text data directly:
data = ('col1,col2,col3\n'
'a,b,1\n'
'a,b,2\n'
'c,d,3')
fread(data)
col1 | col2 | col3 | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | a | b | 1 | |
1 | a | b | 2 | |
2 | c | d | 3 |
Read from a url:
url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
fread(url)
year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | str32 | str32 | str32 | int32 | int32 | int32 | ||
0 | 2014 | 1 | 1 | 14 | 13 | AA | JFK | LAX | 359 | 2475 | 9 | |
1 | 2014 | 1 | 1 | -3 | 13 | AA | JFK | LAX | 363 | 2475 | 11 | |
2 | 2014 | 1 | 1 | 2 | 9 | AA | JFK | LAX | 351 | 2475 | 19 | |
3 | 2014 | 1 | 1 | -8 | -26 | AA | LGA | PBI | 157 | 1035 | 7 | |
4 | 2014 | 1 | 1 | 2 | 1 | AA | JFK | LAX | 350 | 2475 | 13 | |
5 | 2014 | 1 | 1 | 4 | 0 | AA | EWR | LAX | 339 | 2454 | 18 | |
6 | 2014 | 1 | 1 | -2 | -18 | AA | JFK | LAX | 338 | 2475 | 21 | |
7 | 2014 | 1 | 1 | -3 | -14 | AA | JFK | LAX | 356 | 2475 | 15 | |
8 | 2014 | 1 | 1 | -1 | -17 | AA | JFK | MIA | 161 | 1089 | 15 | |
9 | 2014 | 1 | 1 | -2 | -14 | AA | JFK | SEA | 349 | 2422 | 18 | |
10 | 2014 | 1 | 1 | -5 | -17 | AA | EWR | MIA | 161 | 1085 | 16 | |
11 | 2014 | 1 | 1 | 7 | -5 | AA | JFK | SFO | 365 | 2586 | 17 | |
12 | 2014 | 1 | 1 | 3 | 1 | AA | JFK | BOS | 39 | 187 | 12 | |
13 | 2014 | 1 | 1 | 142 | 133 | AA | JFK | LAX | 345 | 2475 | 19 | |
14 | 2014 | 1 | 1 | -5 | -26 | AA | JFK | BOS | 35 | 187 | 17 | |
… | … | … | … | … | … | … | … | … | … | … | … | |
253311 | 2014 | 10 | 31 | 1 | -30 | UA | LGA | IAH | 201 | 1416 | 14 | |
253312 | 2014 | 10 | 31 | -5 | -14 | UA | EWR | IAH | 189 | 1400 | 8 | |
253313 | 2014 | 10 | 31 | -8 | 16 | MQ | LGA | RDU | 83 | 431 | 11 | |
253314 | 2014 | 10 | 31 | -4 | 15 | MQ | LGA | DTW | 75 | 502 | 11 | |
253315 | 2014 | 10 | 31 | -5 | 1 | MQ | LGA | SDF | 110 | 659 | 8 |
Read from an archive (if there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in):
fread("data.zip/mtcars.csv")
Note: Use iread()
if you wish to read in multiple files in an
archive; an iterator of Frames is returned.
Read from .xls
or .xlsx
files
fread("excel.xlsx")
For excel files, you can specify the sheet to be read:
fread("excel.xlsx/Sheet1")
- Note:
Read in data from the command line. Simply pass the command line statement to
the cmd
parameter:
# https://blog.jpalardy.com/posts/awk-tutorial-part-2/
# You specify the `cmd` parameter
# Here we filter data for the year 2015
fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")
The command line can be very handy with large data; you can do some of the
preprocessing before reading in the data to datatable
.
Detect Thousand Separator¶
Fread
handles thousand separator, with the assumption that the separator
is a ,
:
fread("""Name|Salary|Position
James|256,000|evangelist
Ragnar|1,000,000|conqueror
Loki|250360|trickster""")
Name | Salary | Position | ||
---|---|---|---|---|
str32 | int32 | str32 | ||
0 | James | 256000 | evangelist | |
1 | Ragnar | 1000000 | conqueror | |
2 | Loki | 250360 | trickster |
Specify the Delimiter¶
You can specify the delimiter via the sep
parameter.
Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]
:
data = """
1:2:3:4
5:6:7:8
9:10:11:12
"""
C0 | C1 | C2 | C3 | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
Dealing with Null Values and Blank Rows¶
You can pass a list of values to be treated as null, via the na_strings
parameter:
data = """
ID|Charges|Payment_Method
634-VHG|28|Cheque
365-DQC|33.5|Credit card
264-PPR|631|--
845-AJO|42.3|
789-KPO|56.9|Bank Transfer
"""
fread(data, na_strings=['--', ''])
ID | Charges | Payment_Method | ||
---|---|---|---|---|
str32 | float64 | str32 | ||
0 | 634-VHG | 28 | Cheque | |
1 | 365-DQC | 33.5 | Credit card | |
2 | 264-PPR | 631 | NA | |
3 | 845-AJO | 42.3 | NA | |
4 | 789-KPO | 56.9 | Bank Transfer |
For rows with less values than in other rows, you can set fill=True
; fread
will fill with NA
:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11')
fread(data, fill=True)
a | b | c | d | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | NA |
You can skip empty lines:
data = ('a,b,c,d\n'
'\n'
'1,2,3,4\n'
'5,6,7,8\n'
'\n'
'9,10,11,12')
fread(data, skip_blank_lines=True)
a | b | c | d | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
Dealing with Column Names¶
If the data has no headers, fread
will assign default column names:
data = ('1,2\n'
'3,4\n')
fread(data)
C0 | C1 | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 2 | |
1 | 3 | 4 |
You can pass in column names via the columns
parameter:
fread(data, columns=['A','B'])
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 2 | |
1 | 3 | 4 |
You can change column names:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, columns=["A","B","C","D"])
A | B | C | D | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
You can change some of the column names via a dictionary:
fread(data, columns={"a":"A", "b":"B"})
A | B | c | d | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
Fread
uses heuristics to determine whether the first row is data or not;
occasionally it may guess incorrectly, in which case, you can set the
header
parameter to False:
fread(data, header=False)
C0 | C1 | C2 | C3 | ||
---|---|---|---|---|---|
str32 | str32 | str32 | str32 | ||
0 | a | b | c | d | |
1 | 1 | 2 | 3 | 4 | |
2 | 5 | 6 | 7 | 8 | |
3 | 9 | 10 | 11 | 12 |
You can pass a new list of column names as well:
fread(data, header=False, columns=["A","B","C","D"])
A | B | C | D | ||
---|---|---|---|---|---|
str32 | str32 | str32 | str32 | ||
0 | a | b | c | d | |
1 | 1 | 2 | 3 | 4 | |
2 | 5 | 6 | 7 | 8 | |
3 | 9 | 10 | 11 | 12 |
Row Selection¶
Fread
has a skip_to_line
parameter, where you can specify what line to
read the data from:
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2)
a | b | c | d | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
You can also skip to a line containing a particular string with the
skip_to_string
parameter, and start reading data from that line. Note that
skip_to_string
and skip_to_line
cannot be combined; you can only use
one:
data = ('skip this line\n'
'a,b,c,d\n'
'first, second, third, last\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_string='first')
first | second | third | last | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
You can set the maximum number of rows to read with the max_nrows
parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, max_nrows=2)
a | b | c | d | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 |
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2, max_nrows=2)
a | b | c | d | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 |
Setting Column Type¶
You can determine the data types via the columns
parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
# this is useful when you are interested in only a subset of the columns
fread(data, columns={"a":dt.float32, "b":dt.str32})
a | b | c | d | ||
---|---|---|---|---|---|
float64 | str32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
You can also pass in the data types by position:
fread(data, columns = [dt.int32, dt.str32, None, dt.float32])
a | b | d | ||
---|---|---|---|---|
int32 | str32 | float64 | ||
0 | 1 | 2 | 4 | |
1 | 5 | 6 | 8 | |
2 | 9 | 10 | 12 |
You can also change all the column data types with a single assignment:
fread(data, columns = dt.float32)
a | b | c | d | ||
---|---|---|---|---|---|
float64 | float64 | float64 | float64 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
You can change the data type for a slice of the columns (here slice(3)
is equivalent to [:3]
):
# this changes the data type to float for the first three columns
fread(data, columns={float:slice(3)})
a | b | c | d | ||
---|---|---|---|---|---|
float64 | float64 | float64 | int32 | ||
0 | 1 | 2 | 3 | 4 | |
1 | 5 | 6 | 7 | 8 | |
2 | 9 | 10 | 11 | 12 |
Selecting Columns¶
There are various ways to select columns in fread
:
Select with a dictionary:
data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # pass ``Ellipsis : None`` or ``... : None``, # to discard any columns that are not needed fread(data, columns={"a":"a", ... : None})
a int32 0 1 1 5 2 9
Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.
Select columns with a set:
fread(data, columns={"a","b"})
a b int32 int32 0 1 2 1 5 6 2 9 10 Select range of columns with slice:
# select the second and third column fread(data, columns=slice(1,3))
b c int32 int32 0 2 3 1 6 7 2 10 11 # select the first column # jump two hoops and # select the third column fread(data, columns = slice(None,3,2))
a c int32 int32 0 1 3 1 5 7 2 9 11 Select range of columns with range:
fread(data, columns = range(1,3))
b c int32 int32 0 2 3 1 6 7 2 10 11 Boolean Selection:
fread(data, columns=[False, False, True, True])
c d int32 int32 0 3 4 1 7 8 2 11 12 Select with a list comprehension:
fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols])
a c int32 int32 0 1 3 1 5 7 2 9 11 Exclude columns with None:
fread(data, columns = ['a',None,None,'d'])
a d int32 int32 0 1 4 1 5 8 2 9 12 Exclude columns with list comprehension:
fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols])
b d int32 int32 0 2 4 1 6 8 2 10 12 Drop columns by assigning None to the columns via a dictionary:
data = ("A,B,C,D\n" "1,3,5,7\n" "2,4,6,8\n") fread(data, columns={"B":None,"D":None})
A C int32 int32 0 1 5 1 2 6 Drop a column and change data type:
fread(data, columns={"B":None, "C":str})
A C D int32 str32 int32 0 1 5 7 1 2 6 8 Change column name and type, and drop a column:
# pass a tuple, where the first item in the tuple is the new column name, # and the other item is the new data type. fread(data, columns={"A":("first", float), "B":None,"D":None})
first C float64 int32 0 1 5 1 2 6
You can also select which columns to read dynamically, based on the names/types of the columns in the file:
def colfilter(columns):
return [col.name=='species' or "length" in col.name
for col in columns]
fread('iris.csv', columns=colfilter, max_nrows=5)
sepal_length | petal_length | species | ||
---|---|---|---|---|
float64 | float64 | str32 | ||
0 | 5.1 | 1.4 | setosa | |
1 | 4.9 | 1.4 | setosa | |
2 | 4.7 | 1.3 | setosa | |
3 | 4.6 | 1.5 | setosa | |
4 | 5 | 1.4 | setosa |
The same approach can be used to auto-rename columns as they are read from the file:
def rename(columns):
return [col.name.upper() for col in columns]
fread('iris.csv', columns=rename, max_nrows=5)
SEPAL_LENGTH | SEPAL_WIDTH | PETAL_LENGTH | PETAL_WIDTH | SPECIES | ||
---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | ||
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa | |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
Selecting Data¶
Selecting Data – Columns¶
Column selection is via the j section in the DT[i, j, ...]
syntax. First,
let’s construct a simple Frame
:
from datatable import dt, f
from datetime import date
source = {"dates" : [date(2000, 1, 5), date(2010, 11, 23), date(2020, 2, 29), None],
"integers" : range(1, 5),
"floats" : [10.0, 11.5, 12.3, -13],
"strings" : ['A', 'B', None, 'D']
}
DT = dt.Frame(source)
DT
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B | |
2 | 2020-02-29 | 3 | 12.3 | NA | |
3 | NA | 4 | -13 | D |
Column selection is possible via a number of options:
By column name¶
DT[:, 'dates']
dates | ||
---|---|---|
date32 | ||
0 | 2000-01-05 | |
1 | 2010-11-23 | |
2 | 2020-02-29 | |
3 | NA |
When selecting all rows, the i
section can also be ...
.
By position¶
DT[..., 2] # 3rd column
floats | ||
---|---|---|
float64 | ||
0 | 10 | |
1 | 11.5 | |
2 | 12.3 | |
3 | -13 |
With position, you can select with a negative number – the column will be selected from the end; this is similar to indexing a python list:
DT[:, -2] # 2nd column from the end
floats | ||
---|---|---|
float64 | ||
0 | 10 | |
1 | 11.5 | |
2 | 12.3 | |
3 | -13 |
For a single column, it is possible to skip the :
in the i
section and
pass the column name or position only
DT['dates']
dates | ||
---|---|---|
date32 | ||
0 | 2000-01-05 | |
1 | 2010-11-23 | |
2 | 2020-02-29 | |
3 | NA |
DT[0]
dates | ||
---|---|---|
date32 | ||
0 | 2000-01-05 | |
1 | 2010-11-23 | |
2 | 2020-02-29 | |
3 | NA |
When selecting via column name or position, an error is returned if the name or position does not exist:
DT[:, 5]
DT[:, 'categoricals']
By data type¶
Column selection is possible by using python’s built-in types that correspond to one of the datatable’s types:
DT[:, int]
integers | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 | |
3 | 4 |
DT[:, dt.float64]
floats | ||
---|---|---|
float64 | ||
0 | 10 | |
1 | 11.5 | |
2 | 12.3 | |
3 | -13 |
DT[:, dt.ltype.time]
dates | ||
---|---|---|
date32 | ||
0 | 2000-01-05 | |
1 | 2010-11-23 | |
2 | 2020-02-29 | |
3 | NA |
A list of types can be selected as well:
DT[:, [date, str]]
dates | strings | ||
---|---|---|---|
date32 | str32 | ||
0 | 2000-01-05 | A | |
1 | 2010-11-23 | B | |
2 | 2020-02-29 | NA | |
3 | NA | D |
By list¶
Using a list allows for selection of multiple columns:
DT[:, ['integers', 'strings']]
integers | strings | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | A | |
1 | 2 | B | |
2 | 3 | NA | |
3 | 4 | D |
A tuple of selectors is also allowed, although not recommended from stylistic perspective:
DT[:, (-3, 2, 3)]
integers | floats | strings | ||
---|---|---|---|---|
int32 | float64 | str32 | ||
0 | 1 | 10 | A | |
1 | 2 | 11.5 | B | |
2 | 3 | 12.3 | NA | |
3 | 4 | -13 | D |
Selection via list comprehension/generator expression is possible:
DT[:, [num for num in range(DT.ncols) if num % 2 == 0]]
dates | floats | ||
---|---|---|---|
date32 | float64 | ||
0 | 2000-01-05 | 10 | |
1 | 2010-11-23 | 11.5 | |
2 | 2020-02-29 | 12.3 | |
3 | NA | -13 |
Selecting columns via a mix of column names and positions (integers) is not allowed:
DT[:, ['dates', 2]]
Via slicing¶
When slicing with strings, both the start
and end
column names are
included in the returned frame:
DT[:, 'dates':'strings']
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B | |
2 | 2020-02-29 | 3 | 12.3 | NA | |
3 | NA | 4 | -13 | D |
However, when slicing via position, the columns are returned up to, but not including the final position; this is similar to the slicing pattern for Python’s sequences:
DT[:, 1:3]
integers | floats | ||
---|---|---|---|
int32 | float64 | ||
0 | 1 | 10 | |
1 | 2 | 11.5 | |
2 | 3 | 12.3 | |
3 | 4 | -13 |
DT[:, ::-1]
strings | floats | integers | dates | ||
---|---|---|---|---|---|
str32 | float64 | int32 | date32 | ||
0 | A | 10 | 1 | 2000-01-05 | |
1 | B | 11.5 | 2 | 2010-11-23 | |
2 | NA | 12.3 | 3 | 2020-02-29 | |
3 | D | -13 | 4 | NA |
It is possible to select columns via slicing, even if the indices are not in the Frame:
DT[:, 3:10] # there are only four columns in the Frame
strings | ||
---|---|---|
str32 | ||
0 | A | |
1 | B | |
2 | NA | |
3 | D |
Unlike with integer slicing, providing a name of the column that is not in the Frame will result in an error:
DT[:, "integers" : "categoricals"]
Slicing is also possible with the standard slice
function:
DT[:, slice('integers', 'strings')]
integers | floats | strings | ||
---|---|---|---|---|
int32 | float64 | str32 | ||
0 | 1 | 10 | A | |
1 | 2 | 11.5 | B | |
2 | 3 | 12.3 | NA | |
3 | 4 | -13 | D |
With the slice
function, multiple slicing on the columns is possible:
DT[:, [slice("dates", "integers"), slice("floats", "strings")]]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B | |
2 | 2020-02-29 | 3 | 12.3 | NA | |
3 | NA | 4 | -13 | D |
DT[:, [slice("integers", "dates"), slice("strings", "floats")]]
integers | dates | strings | floats | ||
---|---|---|---|---|---|
int32 | date32 | str32 | float64 | ||
0 | 1 | 2000-01-05 | A | 10 | |
1 | 2 | 2010-11-23 | B | 11.5 | |
2 | 3 | 2020-02-29 | NA | 12.3 | |
3 | 4 | NA | D | -13 |
Slicing on strings can be combined with column names during selection:
DT[:, [slice("integers", "dates"), "strings"]]
integers | dates | strings | ||
---|---|---|---|---|
int32 | date32 | str32 | ||
0 | 1 | 2000-01-05 | A | |
1 | 2 | 2010-11-23 | B | |
2 | 3 | 2020-02-29 | NA | |
3 | 4 | NA | D |
But not with integers:
DT[:, [slice("integers", "dates"), 1]]
Slicing on position can be combined with column position:
DT[:, [slice(1, 3), 0]]
integers | floats | dates | ||
---|---|---|---|---|
int32 | float64 | date32 | ||
0 | 1 | 10 | 2000-01-05 | |
1 | 2 | 11.5 | 2010-11-23 | |
2 | 3 | 12.3 | 2020-02-29 | |
3 | 4 | -13 | NA |
But not with strings:
DT[:, [slice(1, 3), "dates"]]
Via booleans¶
When selecting via booleans, the sequence length must be equal to the number of columns in the frame:
DT[:, [True, True, False, False]]
dates | integers | ||
---|---|---|---|
date32 | int32 | ||
0 | 2000-01-05 | 1 | |
1 | 2010-11-23 | 2 | |
2 | 2020-02-29 | 3 | |
3 | NA | 4 |
Booleans generated from a list comprehension/generator expression allow for nifty selections:
DT[:, ["i" in name for name in DT.names]]
integers | strings | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | A | |
1 | 2 | B | |
2 | 3 | NA | |
3 | 4 | D |
In this example we want to select columns that are numeric (integers or floats) and whose average is greater than 3:
DT[:, [column.stype.ltype.name in ("real", "int") and column.mean1() > 3
for column in DT]]
floats | ||
---|---|---|
float64 | ||
0 | 10 | |
1 | 11.5 | |
2 | 12.3 | |
3 | -13 |
Via f-expressions¶
All the selection options above (except boolean) are also possible via f-expressions:
DT[:, f.dates]
dates | ||
---|---|---|
date32 | ||
0 | 2000-01-05 | |
1 | 2010-11-23 | |
2 | 2020-02-29 | |
3 | NA |
DT[:, f[-1]]
strings | ||
---|---|---|
str32 | ||
0 | A | |
1 | B | |
2 | NA | |
3 | D |
DT[:, f['integers':'strings']]
integers | floats | strings | ||
---|---|---|---|---|
int32 | float64 | str32 | ||
0 | 1 | 10 | A | |
1 | 2 | 11.5 | B | |
2 | 3 | 12.3 | NA | |
3 | 4 | -13 | D |
DT[:, f['integers':]]
integers | floats | strings | ||
---|---|---|---|---|
int32 | float64 | str32 | ||
0 | 1 | 10 | A | |
1 | 2 | 11.5 | B | |
2 | 3 | 12.3 | NA | |
3 | 4 | -13 | D |
DT[:, f[1::-1]]
integers | dates | ||
---|---|---|---|
int32 | date32 | ||
0 | 1 | 2000-01-05 | |
1 | 2 | 2010-11-23 | |
2 | 3 | 2020-02-29 | |
3 | 4 | NA |
DT[:, f[date, int, float]]
dates | integers | floats | ||
---|---|---|---|---|
date32 | int32 | float64 | ||
0 | 2000-01-05 | 1 | 10 | |
1 | 2010-11-23 | 2 | 11.5 | |
2 | 2020-02-29 | 3 | 12.3 | |
3 | NA | 4 | -13 |
DT[:, f["dates":"integers", "floats":"strings"]]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B | |
2 | 2020-02-29 | 3 | 12.3 | NA | |
3 | NA | 4 | -13 | D |
Note
If the columns names are python keywords (def
, del
, …), the dot
notation is not possible with f-expressions; you have to use
the brackets notation to access these columns.
Note
Selecting columns with DT[:, f[None]]
returns an empty Frame. This is
different from DT[:, None]
, which currently returns all the columns.
The behavior of DT[:, None]
may change in the future:
DT[:, None]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B | |
2 | 2020-02-29 | 3 | 12.3 | NA | |
3 | NA | 4 | -13 | D |
DT[:, f[None]]
0 | |
1 | |
2 | |
3 |
Selecting Data – Rows¶
There are a number of ways to select rows of data via the i
section.
Note
The index labels in a Frame
are just for aesthetics; they
serve no actual purpose during selection.
By Position¶
Only integer values are acceptable:
DT[0, :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A |
DT[-1, :] # last row
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | NA | 4 | -13 | D |
Via Sequence of Positions¶
Any acceptable sequence of positions is applicable here. Listed below are some of these sequences.
List (tuple):
DT[[1, 2, 3], :]
dates integers floats strings date32 int32 float64 str32 0 2010-11-23 2 11.5 B 1 2020-02-29 3 12.3 NA 2 NA 4 -13 D An integer numpy 1-D Array:
DT[np.arange(3), :]
dates integers floats strings date32 int32 float64 str32 0 2000-01-05 1 10 A 1 2010-11-23 2 11.5 B 2 2020-02-29 3 12.3 NA A one column integer Frame:
DT[dt.Frame([1, 2, 3]), :]
dates integers floats strings date32 int32 float64 str32 0 2010-11-23 2 11.5 B 1 2020-02-29 3 12.3 NA 2 NA 4 -13 D An integer pandas Series:
DT[pd.Series([1, 2, 3]), :]
dates integers floats strings date32 int32 float64 str32 0 2010-11-23 2 11.5 B 1 2020-02-29 3 12.3 NA 2 NA 4 -13 D A python range:
DT[range(1, 3), :]
dates integers floats strings date32 int32 float64 str32 0 2010-11-23 2 11.5 B 1 2020-02-29 3 12.3 NA -
DT[(num for num in range(4)), :]
dates integers floats strings date32 int32 float64 str32 0 2000-01-05 1 10 A 1 2010-11-23 2 11.5 B 2 2020-02-29 3 12.3 NA 3 NA 4 -13 D
If the position passed to i
does not exist, an error is raised
DT[(num for num in range(7)), :]
The set sequence is not acceptable in the i
or j
sections.
Except for lists
/tuples
, all the other sequence types passed into
the i
section can only contain positive integers.
Via booleans¶
When selecting rows via boolean sequence, the length of the sequence must be the same as the number of rows:
DT[[True, True, False, False], :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B |
DT[(n%2 == 0 for n in range(DT.nrows)), :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2020-02-29 | 3 | 12.3 | NA |
Via slicing¶
Slicing works similarly to slicing a python list
:
DT[1:3, :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2010-11-23 | 2 | 11.5 | B | |
1 | 2020-02-29 | 3 | 12.3 | NA |
DT[::-1, :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | NA | 4 | -13 | D | |
1 | 2020-02-29 | 3 | 12.3 | NA | |
2 | 2010-11-23 | 2 | 11.5 | B | |
3 | 2000-01-05 | 1 | 10 | A |
DT[-1:-3:-1, :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | NA | 4 | -13 | D | |
1 | 2020-02-29 | 3 | 12.3 | NA |
Slicing is also possible with the slice
function:
DT[slice(1, 3), :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2010-11-23 | 2 | 11.5 | B | |
1 | 2020-02-29 | 3 | 12.3 | NA |
It is possible to select rows with multiple slices. Let’s increase the number of rows in the Frame:
DT = dt.repeat(DT, 3)
DT
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B | |
2 | 2020-02-29 | 3 | 12.3 | NA | |
3 | NA | 4 | -13 | D | |
4 | 2000-01-05 | 1 | 10 | A | |
5 | 2010-11-23 | 2 | 11.5 | B | |
6 | 2020-02-29 | 3 | 12.3 | NA | |
7 | NA | 4 | -13 | D | |
8 | 2000-01-05 | 1 | 10 | A | |
9 | 2010-11-23 | 2 | 11.5 | B | |
10 | 2020-02-29 | 3 | 12.3 | NA | |
11 | NA | 4 | -13 | D |
DT[[slice(1, 3), slice(5, 8)], :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2010-11-23 | 2 | 11.5 | B | |
1 | 2020-02-29 | 3 | 12.3 | NA | |
2 | 2010-11-23 | 2 | 11.5 | B | |
3 | 2020-02-29 | 3 | 12.3 | NA | |
4 | NA | 4 | -13 | D |
DT[[slice(5, 8), 1, 3, slice(10, 12)], :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2010-11-23 | 2 | 11.5 | B | |
1 | 2020-02-29 | 3 | 12.3 | NA | |
2 | NA | 4 | -13 | D | |
3 | 2010-11-23 | 2 | 11.5 | B | |
4 | NA | 4 | -13 | D | |
5 | 2020-02-29 | 3 | 12.3 | NA | |
6 | NA | 4 | -13 | D |
Via f-expressions¶
f-expressions return booleans that can be used to filter/select the appropriate rows:
DT[f.dates < dt.Frame([date(2020,1,1)]), :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B |
DT[f.integers % 2 != 0, :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2020-02-29 | 3 | 12.3 | NA |
DT[(f.integers == 3) & (f.strings == None), ...]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2020-02-29 | 3 | 12.3 | NA | |
1 | 2020-02-29 | 3 | 12.3 | NA | |
2 | 2020-02-29 | 3 | 12.3 | NA |
Selection is possible via the data types:
DT[f[float] < 1, :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | NA | 4 | -13 | D | |
1 | NA | 4 | -13 | D | |
2 | NA | 4 | -13 | D |
DT[dt.rowsum(f[int, float]) > 12, :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2010-11-23 | 2 | 11.5 | B | |
1 | 2020-02-29 | 3 | 12.3 | NA | |
2 | 2010-11-23 | 2 | 11.5 | B | |
3 | 2020-02-29 | 3 | 12.3 | NA | |
4 | 2010-11-23 | 2 | 11.5 | B | |
5 | 2020-02-29 | 3 | 12.3 | NA |
Select rows and columns¶
Specific selections can occur in rows and columns simultaneously:
DT[0, slice(1, 3)]
integers | floats | ||
---|---|---|---|
int32 | float64 | ||
0 | 1 | 10 |
DT[2 : 6, ["i" in name for name in DT.names]]
integers | strings | ||
---|---|---|---|
int32 | str32 | ||
0 | 3 | NA | |
1 | 4 | D | |
2 | 1 | A | |
3 | 2 | B |
DT[f.integers > dt.mean(f.floats) - 3, f['strings' : 'integers']]
strings | floats | integers | ||
---|---|---|---|---|
str32 | float64 | int32 | ||
0 | NA | 12.3 | 3 | |
1 | D | -13 | 4 | |
2 | NA | 12.3 | 3 | |
3 | D | -13 | 4 | |
4 | NA | 12.3 | 3 | |
5 | D | -13 | 4 |
Single value access¶
Passing single integers into the i
and j
sections returns a scalar value:
DT[0, 0]
DT[0, 2]
DT[-3, 'strings']
Deselect rows/columns¶
Deselection of rows/columns is possible via list comprehension/generator expression
Deselect a single column/row:
# The list comprehension returns the specific column names DT[:, [name for name in DT.names if name != "integers"]]
dates floats strings date32 float64 str32 0 2000-01-05 10 A 1 2010-11-23 11.5 B 2 2020-02-29 12.3 NA 3 NA -13 D 4 2000-01-05 10 A 5 2010-11-23 11.5 B 6 2020-02-29 12.3 NA 7 NA -13 D 8 2000-01-05 10 A 9 2010-11-23 11.5 B 10 2020-02-29 12.3 NA 11 NA -13 D # A boolean sequence is returned in the list comprehension DT[[num != 5 for num in range(DT.nrows)], 'dates']
dates date32 0 2000-01-05 1 2010-11-23 2 2020-02-29 3 NA 4 2000-01-05 5 2020-02-29 6 NA 7 2000-01-05 8 2010-11-23 9 2020-02-29 10 NA Deselect multiple columns/rows:
DT[:, [name not in ("integers", "dates") for name in DT.names]]
floats strings float64 str32 0 10 A 1 11.5 B 2 12.3 NA 3 -13 D 4 10 A 5 11.5 B 6 12.3 NA 7 -13 D 8 10 A 9 11.5 B 10 12.3 NA 11 -13 D DT[(num not in range(3, 8) for num in range(DT.nrows)), ['integers', 'floats']]
integers floats int32 float64 0 1 10 1 2 11.5 2 3 12.3 3 1 10 4 2 11.5 5 3 12.3 6 4 -13 DT[:, [num not in (2, 3) for num in range(DT.ncols)]]
dates integers date32 int32 0 2000-01-05 1 1 2010-11-23 2 2 2020-02-29 3 3 NA 4 4 2000-01-05 1 5 2010-11-23 2 6 2020-02-29 3 7 NA 4 8 2000-01-05 1 9 2010-11-23 2 10 2020-02-29 3 11 NA 4 # an alternative to the previous example DT[:, [num not in (2, 3) for num, _ in enumerate(DT.names)]]
dates integers date32 int32 0 2000-01-05 1 1 2010-11-23 2 2 2020-02-29 3 3 NA 4 4 2000-01-05 1 5 2010-11-23 2 6 2020-02-29 3 7 NA 4 8 2000-01-05 1 9 2010-11-23 2 10 2020-02-29 3 11 NA 4 Deselect by data type:
# This selects columns that are not numeric DT[2 : 7, (dtype.name not in ("real", "int") for dtype in DT.ltypes)]
dates strings date32 str32 0 2020-02-29 NA 1 NA D 2 2000-01-05 A 3 2010-11-23 B 4 2020-02-29 NA
Slicing could be used to exclude rows/columns. The code below excludes rows from position 3 to 6:
DT[[slice(None, 3), slice(7, None)], :]
dates | integers | floats | strings | ||
---|---|---|---|---|---|
date32 | int32 | float64 | str32 | ||
0 | 2000-01-05 | 1 | 10 | A | |
1 | 2010-11-23 | 2 | 11.5 | B | |
2 | 2020-02-29 | 3 | 12.3 | NA | |
3 | NA | 4 | -13 | D | |
4 | 2000-01-05 | 1 | 10 | A | |
5 | 2010-11-23 | 2 | 11.5 | B | |
6 | 2020-02-29 | 3 | 12.3 | NA | |
7 | NA | 4 | -13 | D |
Columns can also be deselected via the remove()
method, where the column name, column position, or data type is passed to the
f
symbol:
DT[:, f[:].remove(f.dates)]
integers | floats | strings | ||
---|---|---|---|---|
int32 | float64 | str32 | ||
0 | 1 | 10 | A | |
1 | 2 | 11.5 | B | |
2 | 3 | 12.3 | NA | |
3 | 4 | -13 | D | |
4 | 1 | 10 | A | |
5 | 2 | 11.5 | B | |
6 | 3 | 12.3 | NA | |
7 | 4 | -13 | D | |
8 | 1 | 10 | A | |
9 | 2 | 11.5 | B | |
10 | 3 | 12.3 | NA | |
11 | 4 | -13 | D |
DT[:, f[:].remove(f[0])]
integers | floats | strings | ||
---|---|---|---|---|
int32 | float64 | str32 | ||
0 | 1 | 10 | A | |
1 | 2 | 11.5 | B | |
2 | 3 | 12.3 | NA | |
3 | 4 | -13 | D | |
4 | 1 | 10 | A | |
5 | 2 | 11.5 | B | |
6 | 3 | 12.3 | NA | |
7 | 4 | -13 | D | |
8 | 1 | 10 | A | |
9 | 2 | 11.5 | B | |
10 | 3 | 12.3 | NA | |
11 | 4 | -13 | D |
DT[:, f[:].remove(f[1:3])]
dates | strings | ||
---|---|---|---|
date32 | str32 | ||
0 | 2000-01-05 | A | |
1 | 2010-11-23 | B | |
2 | 2020-02-29 | NA | |
3 | NA | D | |
4 | 2000-01-05 | A | |
5 | 2010-11-23 | B | |
6 | 2020-02-29 | NA | |
7 | NA | D | |
8 | 2000-01-05 | A | |
9 | 2010-11-23 | B | |
10 | 2020-02-29 | NA | |
11 | NA | D |
DT[:, f[:].remove(f['strings':'integers'])]
dates | ||
---|---|---|
date32 | ||
0 | 2000-01-05 | |
1 | 2010-11-23 | |
2 | 2020-02-29 | |
3 | NA | |
4 | 2000-01-05 | |
5 | 2010-11-23 | |
6 | 2020-02-29 | |
7 | NA | |
8 | 2000-01-05 | |
9 | 2010-11-23 | |
10 | 2020-02-29 | |
11 | NA |
DT[:, f[:].remove(f[int, float])]
dates | strings | ||
---|---|---|---|
date32 | str32 | ||
0 | 2000-01-05 | A | |
1 | 2010-11-23 | B | |
2 | 2020-02-29 | NA | |
3 | NA | D | |
4 | 2000-01-05 | A | |
5 | 2010-11-23 | B | |
6 | 2020-02-29 | NA | |
7 | NA | D | |
8 | 2000-01-05 | A | |
9 | 2010-11-23 | B | |
10 | 2020-02-29 | NA | |
11 | NA | D |
DT[:, f[:].remove(f[:])]
0 | |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 | |
11 |
Delete rows/columns¶
To actually delete a row (or a column), use the del statement; this is an in-place operation, and as such no reassignment is needed
Delete multiple rows:
del DT[3:7, :] DT
dates integers floats strings date32 int32 float64 str32 0 2000-01-05 1 10 A 1 2010-11-23 2 11.5 B 2 2020-02-29 3 12.3 NA 3 NA 4 -13 D 4 2000-01-05 1 10 A 5 2010-11-23 2 11.5 B 6 2020-02-29 3 12.3 NA 7 NA 4 -13 D Delete a single row:
del DT[3, :] DT
dates integers floats date32 int32 float64 0 2000-01-05 1 10 1 2010-11-23 2 11.5 2 2020-02-29 NA NA 3 2000-01-05 NA NA 4 2010-11-23 2 11.5 5 2020-02-29 3 12.3 6 NA 4 -13 Delete a column:
del DT['strings'] DT
dates integers floats date32 int32 float64 0 2000-01-05 1 10 1 2010-11-23 2 11.5 2 2020-02-29 3 12.3 3 NA 4 -13 4 2000-01-05 1 10 5 2010-11-23 2 11.5 6 2020-02-29 3 12.3 7 NA 4 -13 Delete multiple columns:
del DT[:, ['dates', 'floats']] DT
integers int32 0 1 1 2 2 NA 3 NA 4 2 5 3 6 4
Grouping with by()
¶
The by()
modifier splits a dataframe into groups, either via the
provided column(s) or f-expressions, and then applies i
and j
within each group. This split-apply-combine strategy allows for a number
of operations:
Aggregations per group,
Transformation of a column or columns, where the shape of the dataframe is maintained,
Filtration, where some data are kept and the others discarded, based on a condition or conditions.
Aggregation¶
The aggregate function is applied in the j
section.
Group by one column:
from datatable import (dt, f, by, ifelse, update, sort,
count, min, max, mean, sum, rowsum)
df = dt.Frame("""Fruit Date Name Number
Apples 10/6/2016 Bob 7
Apples 10/6/2016 Bob 8
Apples 10/6/2016 Mike 9
Apples 10/7/2016 Steve 10
Apples 10/7/2016 Bob 1
Oranges 10/7/2016 Bob 2
Oranges 10/6/2016 Tom 15
Oranges 10/6/2016 Mike 57
Oranges 10/6/2016 Bob 65
Oranges 10/7/2016 Tony 1
Grapes 10/7/2016 Bob 1
Grapes 10/7/2016 Tom 87
Grapes 10/7/2016 Bob 22
Grapes 10/7/2016 Bob 12
Grapes 10/7/2016 Tony 15""")
df[:, sum(f.Number), by('Fruit')]
Fruit | Number | ||
---|---|---|---|
str32 | int64 | ||
0 | Apples | 35 | |
1 | Grapes | 137 | |
2 | Oranges | 140 |
Group by multiple columns:
df[:, sum(f.Number), by('Fruit', 'Name')]
Fruit | Name | Number | ||
---|---|---|---|---|
str32 | str32 | int64 | ||
0 | Apples | Bob | 16 | |
1 | Apples | Mike | 9 | |
2 | Apples | Steve | 10 | |
3 | Grapes | Bob | 35 | |
4 | Grapes | Tom | 87 | |
5 | Grapes | Tony | 15 | |
6 | Oranges | Bob | 67 | |
7 | Oranges | Mike | 57 | |
8 | Oranges | Tom | 15 | |
9 | Oranges | Tony | 1 |
By column position:
df[:, sum(f.Number), by(f[0])]
Fruit | Number | ||
---|---|---|---|
str32 | int64 | ||
0 | Apples | 35 | |
1 | Grapes | 137 | |
2 | Oranges | 140 |
By boolean expression:
df[:, sum(f.Number), by(f.Fruit == "Apples")]
C0 | Number | ||
---|---|---|---|
bool8 | int64 | ||
0 | 0 | 277 | |
1 | 1 | 35 |
Combination of column and boolean expression:
df[:, sum(f.Number), by(f.Name, f.Fruit == "Apples")]
Name | C0 | Number | ||
---|---|---|---|---|
str32 | bool8 | int64 | ||
0 | Bob | 0 | 102 | |
1 | Bob | 1 | 16 | |
2 | Mike | 0 | 57 | |
3 | Mike | 1 | 9 | |
4 | Steve | 1 | 10 | |
5 | Tom | 0 | 102 | |
6 | Tony | 0 | 16 |
The grouping column can be excluded from the final output:
df[:, sum(f.Number), by('Fruit', add_columns=False)]
Number | ||
---|---|---|
int64 | ||
0 | 35 | |
1 | 137 | |
2 | 140 |
Note
The resulting dataframe has the grouping column(s) as the first column(s).
The grouping columns are excluded from
j
, unless explicitly included.The grouping columns are sorted in ascending order.
Apply multiple aggregate functions to a column in the j
section:
df[:, {"min": min(f.Number),
"max": max(f.Number)},
by('Fruit','Date')]
Fruit | Date | min | max | ||
---|---|---|---|---|---|
str32 | str32 | int32 | int32 | ||
0 | Apples | 10/6/2016 | 7 | 9 | |
1 | Apples | 10/7/2016 | 1 | 10 | |
2 | Grapes | 10/7/2016 | 1 | 87 | |
3 | Oranges | 10/6/2016 | 15 | 65 | |
4 | Oranges | 10/7/2016 | 1 | 2 |
Functions can be applied across a columnset. Task : Get sum of col3
and
col4
, grouped by col1
and col2
:
df = dt.Frame(""" col1 col2 col3 col4 col5
a c 1 2 f
a c 1 2 f
a d 1 2 f
b d 1 2 g
b e 1 2 g
b e 1 2 g""")
df[:, sum(f["col3":"col4"]), by('col1', 'col2')]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | str32 | int64 | int64 | ||
0 | a | c | 2 | 4 | |
1 | a | d | 1 | 2 | |
2 | b | d | 1 | 2 | |
3 | b | e | 2 | 4 |
Apply different aggregate functions to different columns:
df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | str32 | int8 | int32 | ||
0 | a | c | 1 | 2 | |
1 | a | d | 1 | 2 | |
2 | b | d | 1 | 2 | |
3 | b | e | 1 | 2 |
Nested aggregations in j
. Task : Group by column idx
and get the row
sum of A
and B
, C
and D
:
df = dt.Frame(""" idx A B C D cat
J 1 2 3 1 x
K 4 5 6 2 x
L 7 8 9 3 y
M 1 2 3 4 y
N 4 5 6 5 z
O 7 8 9 6 z""")
df[:,
{"AB" : sum(rowsum(f['A':'B'])),
"CD" : sum(rowsum(f['C':'D']))},
by('cat')
]
cat | AB | CD | ||
---|---|---|---|---|
str32 | int64 | int64 | ||
0 | x | 12 | 12 | |
1 | y | 18 | 19 | |
2 | z | 24 | 26 |
Computation between aggregated columns. Task: get the difference between the largest and smallest value within each group:
df = dt.Frame("""GROUP VALUE
1 5
2 2
1 10
2 20
1 7""")
df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]
GROUP | C0 | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 5 | |
1 | 2 | 18 |
Null values are not excluded from the grouping column:
df = dt.Frame(""" a b c
1 2.0 3
1 NaN 4
2 1.0 3
1 2.0 2""")
df[:, sum(f[:]), by('b')]
b | a | c | ||
---|---|---|---|---|
float64 | int64 | int64 | ||
0 | NA | 1 | 4 | |
1 | 1 | 2 | 3 | |
2 | 2 | 2 | 5 |
If you wish to ignore null values, first filter them out:
df[f.b != None, :][:, sum(f[:]), by('b')]
b | a | c | ||
---|---|---|---|---|
float64 | int64 | int64 | ||
0 | 1 | 2 | 3 | |
1 | 2 | 2 | 5 |
Filtration¶
This occurs in the i
section of the groupby, where only a subset of the
data per group is needed; selection is limited to integers or slicing.
Note
i
is applied after the grouping, not before.f-expressions in the
i
section is not yet implemented for groupby.
Select the first row per group:
df = dt.Frame("""A B
1 10
1 20
2 30
2 40
3 10""")
# passing 0 as index gets the first row after the grouping
# note that python's index starts from 0, not 1
df[0, :, by('A')]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 10 | |
1 | 2 | 30 | |
2 | 3 | 10 |
Select the last row per group:
df[-1, :, by('A')]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 20 | |
1 | 2 | 40 | |
2 | 3 | 10 |
Select the nth row per group. Task : select the second row per group:
df[1, :, by('A')]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 20 | |
1 | 2 | 40 |
Note
Filtering this way can be used to drop duplicates; you can decide to keep the first or last non-duplicate.
Select the latest entry per group:
df = dt.Frame(""" id product date
220 6647 2014-09-01
220 6647 2014-09-03
220 6647 2014-10-16
826 3380 2014-11-11
826 3380 2014-12-09
826 3380 2015-05-19
901 4555 2014-09-01
901 4555 2014-10-05
901 4555 2014-11-01""")
df[-1, :, by('id'), sort('date')]
id | product | date | ||
---|---|---|---|---|
int32 | int32 | str32 | ||
0 | 220 | 6647 | 2014-10-16 | |
1 | 826 | 3380 | 2015-05-19 | |
2 | 901 | 4555 | 2014-11-01 |
Note
If sort
and by
modifiers are present, the sorting occurs after
the grouping, and occurs within each group.
Replicate SQL
’s HAVING
clause. Task: Filter for groups where the
length/count is greater than 1:
df = dt.Frame([[1, 1, 5], [2, 3, 6]], names=['A', 'B'])
df
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 2 | |
1 | 1 | 3 | |
2 | 5 | 6 |
# Get the count of each group,
# and assign to a new column, using the update method
# note that the update operation is in-place;
# there is no need to assign back to the dataframe
df[:, update(filter_col = count()), by('A')]
# The new column will be added to the end
# We use an f-expression to return rows
# in each group where the count is greater than 1
df[f.filter_col > 1, f[:-1]]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 2 | |
1 | 1 | 3 |
Keep only rows per group where diff
is the minimum:
df = dt.Frame(""" item diff otherstuff
1 2 1
1 1 2
1 3 7
2 -1 0
2 1 3
2 4 9
2 -6 2
3 0 0
3 2 9""")
df[:,
#get boolean for rows where diff column is minimum for each group
update(filter_col = f.diff == min(f.diff)),
by('item')]
df[f.filter_col == 1, :-1]
item | diff | otherstuff | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | 1 | 2 | |
1 | 2 | -6 | 2 | |
2 | 3 | 0 | 0 |
Keep only entries where make
has both 0 and 1 in sales
:
df = dt.Frame(""" make country other_columns sale
honda tokyo data 1
honda hirosima data 0
toyota tokyo data 1
toyota hirosima data 0
suzuki tokyo data 0
suzuki hirosima data 0
ferrari tokyo data 1
ferrari hirosima data 0
nissan tokyo data 1
nissan hirosima data 0""")
df[:,
update(filter_col = sum(f.sale)),
by('make')]
df[f.filter_col == 1, :-1]
make | country | other_columns | sale | ||
---|---|---|---|---|---|
str32 | str32 | str32 | bool8 | ||
0 | honda | tokyo | data | 1 | |
1 | honda | hirosima | data | 0 | |
2 | toyota | tokyo | data | 1 | |
3 | toyota | hirosima | data | 0 | |
4 | ferrari | tokyo | data | 1 | |
5 | ferrari | hirosima | data | 0 | |
6 | nissan | tokyo | data | 1 | |
7 | nissan | hirosima | data | 0 |
Transformation¶
This is when a function is applied to a column after a groupby and the resulting column is appended back to the dataframe. The number of rows of the dataframe is unchanged. The update()
method makes this possible and easy.
Get the minimum and maximum of column c
per group, and append to dataframe:
df = dt.Frame(""" c y
9 0
8 0
3 1
6 2
1 3
2 3
5 3
4 4
0 4
7 4""")
# Assign the new columns via the update method
df[:,
update(min_col = min(f.c),
max_col = max(f.c)),
by('y')]
df
c | y | min_col | max_col | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 9 | 0 | 8 | 9 | |
1 | 8 | 0 | 8 | 9 | |
2 | 3 | 1 | 3 | 3 | |
3 | 6 | 2 | 6 | 6 | |
4 | 1 | 3 | 1 | 5 | |
5 | 2 | 3 | 1 | 5 | |
6 | 5 | 3 | 1 | 5 | |
7 | 4 | 4 | 0 | 7 | |
8 | 0 | 4 | 0 | 7 | |
9 | 7 | 4 | 0 | 7 |
Fill missing values by group mean:
df = dt.Frame({'value' : [1, None, None, 2, 3, 1, 3, None, 3],
'name' : ['A','A', 'B','B','B','B', 'C','C','C']})
df
value | name | ||
---|---|---|---|
float64 | str32 | ||
0 | 1 | A | |
1 | NA | A | |
2 | NA | B | |
3 | 2 | B | |
4 | 3 | B | |
5 | 1 | B | |
6 | 3 | C | |
7 | NA | C | |
8 | 3 | C |
# This uses a combination of update and ifelse methods:
df[:,
update(value = ifelse(f.value == None,
mean(f.value),
f.value)),
by('name')]
df
value | name | ||
---|---|---|---|
float64 | str32 | ||
0 | 1 | A | |
1 | 1 | A | |
2 | 2 | B | |
3 | 2 | B | |
4 | 3 | B | |
5 | 1 | B | |
6 | 3 | C | |
7 | 3 | C | |
8 | 3 | C |
Transform and Aggregate on multiple columns¶
Task: Get the sum of the aggregate of column a
and b
, grouped by c
and d
and append to dataframe:
df = dt.Frame({'a' : [1,2,3,4,5,6],
'b' : [1,2,3,4,5,6],
'c' : ['q', 'q', 'q', 'q', 'w', 'w'],
'd' : ['z','z','z','o','o','o']})
df
a | b | c | d | ||
---|---|---|---|---|---|
int32 | int32 | str32 | str32 | ||
0 | 1 | 1 | q | z | |
1 | 2 | 2 | q | z | |
2 | 3 | 3 | q | z | |
3 | 4 | 4 | q | o | |
4 | 5 | 5 | w | o | |
5 | 6 | 6 | w | o |
df[:,
update(e = sum(f.a) + sum(f.b)),
by('c', 'd')
]
df
a | b | c | d | e | ||
---|---|---|---|---|---|---|
int32 | int32 | str32 | str32 | int64 | ||
0 | 1 | 1 | q | z | 12 | |
1 | 2 | 2 | q | z | 12 | |
2 | 3 | 3 | q | z | 12 | |
3 | 4 | 4 | q | o | 8 | |
4 | 5 | 5 | w | o | 22 | |
5 | 6 | 6 | w | o | 22 |
Replicate R’s groupby mutate¶
Task : Get ratio by dividing column c
by the product of column c
and d
, grouped by a
and b
:
df = dt.Frame(dict(a = (1,1,0,1,0),
b = (1,0,0,1,0),
c = (10,5,1,5,10),
d = (3,1,2,1,2))
)
df
a | b | c | d | ||
---|---|---|---|---|---|
int8 | int8 | int32 | int32 | ||
0 | 1 | 1 | 10 | 3 | |
1 | 1 | 0 | 5 | 1 | |
2 | 0 | 0 | 1 | 2 | |
3 | 1 | 1 | 5 | 1 | |
4 | 0 | 0 | 10 | 2 |
df[:,
update(ratio = f.c / sum(f.c * f.d)),
by('a', 'b')
]
df
a | b | c | d | ratio | ||
---|---|---|---|---|---|---|
int8 | int8 | int32 | int32 | float64 | ||
0 | 1 | 1 | 10 | 3 | 0.285714 | |
1 | 1 | 0 | 5 | 1 | 1 | |
2 | 0 | 0 | 1 | 2 | 0.0454545 | |
3 | 1 | 1 | 5 | 1 | 0.142857 | |
4 | 0 | 0 | 10 | 2 | 0.454545 |
Groupby on boolean expressions¶
Conditional sum with groupby¶
Task: sum data1
column, grouped by key1
and rows where
key2 == "one"
:
df = dt.Frame("""data1 data2 key1 key2
0.361601 0.375297 a one
0.069889 0.809772 a two
1.468194 0.272929 b one
-1.138458 0.865060 b two
-0.268210 1.250340 a one""")
df[:,
sum(f.data1),
by(f.key2 == "one", f.key1)][f.C0 == 1, 1:]
key1 | data1 | ||
---|---|---|---|
str32 | float64 | ||
0 | a | 0.093391 | |
1 | b | 1.46819 |
Conditional sums based on various criteria¶
df = dt.Frame(""" A_id B C
a1 "up" 100
a2 "down" 102
a3 "up" 100
a3 "up" 250
a4 "left" 100
a5 "right" 102""")
df[:,
{"sum_up": sum(f.B == "up"),
"sum_down" : sum(f.B == "down"),
"over_200_up" : sum((f.B == "up") & (f.C > 200))
},
by('A_id')]
A_id | sum_up | sum_down | over_200_up | ||
---|---|---|---|---|---|
str32 | int64 | int64 | int64 | ||
0 | a1 | 1 | 0 | 0 | |
1 | a2 | 0 | 1 | 0 | |
2 | a3 | 2 | 0 | 1 | |
3 | a4 | 0 | 0 | 0 | |
4 | a5 | 0 | 0 | 0 |
More Examples¶
Aggregation on values in a column¶
Task: group by Day
and find minimum Data_Value
for elements of type
TMIN
and maximum Data_Value
for elements of type TMAX
:
df = dt.Frame(""" Day Element Data_Value
01-01 TMAX 112
01-01 TMAX 101
01-01 TMIN 60
01-01 TMIN 0
01-01 TMIN 25
01-01 TMAX 113
01-01 TMAX 115
01-01 TMAX 105
01-01 TMAX 111
01-01 TMIN 44
01-01 TMIN 83
01-02 TMAX 70
01-02 TMAX 79
01-02 TMIN 0
01-02 TMIN 60
01-02 TMAX 73
01-02 TMIN 31
01-02 TMIN 26
01-02 TMAX 71
01-02 TMIN 26""")
df[:,
{"TMAX": max(ifelse(f.Element=="TMAX", f.Data_Value, None)),
"TMIN": min(ifelse(f.Element=="TMIN", f.Data_Value, None))},
by(f.Day)]
Day | TMAX | TMIN | ||
---|---|---|---|---|
str32 | int32 | int32 | ||
0 | 01-01 | 115 | 0 | |
1 | 01-02 | 79 | 0 |
Group-by and conditional sum and add back to data frame¶
Task: sum the Count
value for each ID
, when Num
is (17 or 12) and
Letter
is 'D'
and then add the calculation back to the original data
frame as column 'Total'
:
df = dt.Frame(""" ID Num Letter Count
1 17 D 1
1 12 D 2
1 13 D 3
2 17 D 4
2 12 A 5
2 16 D 1
3 16 D 1""")
expression = ((f.Num==17) | (f.Num==12)) & (f.Letter == "D")
df[:, update(Total = sum(expression * f.Count)),
by(f.ID)]
df
ID | Num | Letter | Count | Total | ||
---|---|---|---|---|---|---|
int32 | int32 | str32 | int32 | int64 | ||
0 | 1 | 17 | D | 1 | 3 | |
1 | 1 | 12 | D | 2 | 3 | |
2 | 1 | 13 | D | 3 | 3 | |
3 | 2 | 17 | D | 4 | 4 | |
4 | 2 | 12 | A | 5 | 4 | |
5 | 2 | 16 | D | 1 | 4 | |
6 | 3 | 16 | D | 1 | 0 |
Indexing with multiple min and max in one aggregate¶
Task : find col1
where col2
is max, col2
where col3
is min
and col1
where col3
is max:
df = dt.Frame({
"id" : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
"col1" : [1, 3, 5, 2, 5, 3, 6, 3, 67, 7],
"col2" : [4, 6, 8, 3, 65, 3, 5, 4, 4, 7],
"col3" : [34, 64, 53, 5, 6, 2, 4, 6, 4, 67],
})
df
id | col1 | col2 | col3 | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 1 | 4 | 34 | |
1 | 1 | 3 | 6 | 64 | |
2 | 1 | 5 | 8 | 53 | |
3 | 2 | 2 | 3 | 5 | |
4 | 2 | 5 | 65 | 6 | |
5 | 2 | 3 | 3 | 2 | |
6 | 2 | 6 | 5 | 4 | |
7 | 3 | 3 | 4 | 6 | |
8 | 3 | 67 | 4 | 4 | |
9 | 3 | 7 | 7 | 67 |
df[:,
{'col1' : max(ifelse(f.col2 == max(f.col2), f.col1, None)),
'col2' : max(ifelse(f.col3 == min(f.col3), f.col2, None)),
'col3' : max(ifelse(f.col3 == max(f.col3), f.col1, None))
},
by('id')]
id | col1 | col2 | col3 | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 5 | 4 | 3 | |
1 | 2 | 5 | 3 | 5 | |
2 | 3 | 7 | 4 | 7 |
Filter rows based on aggregate value¶
Task: for every word
find the tag
that has the most count
:
df = dt.Frame("""word tag count
a S 30
the S 20
a T 60
an T 5
the T 10""")
# The solution builds on the knowledge that sorting
# while grouping sorts within each group.
df[0, :, by('word'), sort(-f.count)]
word | tag | count | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | a | T | 60 | |
1 | an | T | 5 | |
2 | the | S | 20 |
Get the rows where the value
column is minimum, and rename columns:
df = dt.Frame({"category": ["A"]*3 + ["B"]*3,
"date": ["9/6/2016", "10/6/2016",
"11/6/2016", "9/7/2016",
"10/7/2016", "11/7/2016"],
"value": [7,8,9,10,1,2]})
df
category | date | value | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | A | 9/6/2016 | 7 | |
1 | A | 10/6/2016 | 8 | |
2 | A | 11/6/2016 | 9 | |
3 | B | 9/7/2016 | 10 | |
4 | B | 10/7/2016 | 1 | |
5 | B | 11/7/2016 | 2 |
df[0,
{"value_date": f.date,
"value_min": f.value},
by("category"),
sort('value')]
category | value_date | value_min | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | A | 9/6/2016 | 7 | |
1 | B | 10/7/2016 | 1 |
Get the rows where the value
column is maximum, and rename columns:
df[0,
{"value_date": f.date,
"value_max": f.value},
by("category"),
sort(-f.value)]
category | value_date | value_max | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | A | 11/6/2016 | 9 | |
1 | B | 9/7/2016 | 10 |
Get the average of the last three instances per group:
import random
random.seed(3)
df = dt.Frame({"Student": ["Bob", "Bill",
"Bob", "Bob",
"Bill","Joe",
"Joe", "Bill",
"Bob", "Joe",],
"Score": random.sample(range(10,30), 10)})
df
Student | Score | ||
---|---|---|---|
str32 | int32 | ||
0 | Bob | 17 | |
1 | Bill | 28 | |
2 | Bob | 27 | |
3 | Bob | 14 | |
4 | Bill | 21 | |
5 | Joe | 24 | |
6 | Joe | 19 | |
7 | Bill | 29 | |
8 | Bob | 20 | |
9 | Joe | 23 |
df[-3:, mean(f[:]), f.Student]
Student | Score | ||
---|---|---|---|
str32 | float64 | ||
0 | Bill | 26 | |
1 | Bob | 20.3333 | |
2 | Joe | 22 |
Group by on a condition¶
Get the sum of Amount
for Number
in range (1 to 4) and (5 and above):
df = dt.Frame("""Number, Amount
1, 5
2, 10
3, 11
4, 3
5, 5
6, 8
7, 9
8, 6""")
df[:, sum(f.Amount), by(ifelse(f.Number>=5, "B", "A"))]
C0 | Amount | ||
---|---|---|---|
str32 | int64 | ||
0 | A | 29 | |
1 | B | 28 |
Row Functions¶
Functions rowall()
, rowany()
, rowcount()
, rowfirst()
,
rowlast()
, rowmax()
, rowmean()
, rowmin()
, rowsd()
,
rowsum()
are functions that aggregate across rows instead of columns and
return a single column. These functions are equivalent to pandas aggregation
functions with parameter (axis=1)
.
These functions make it easy to compute rowwise aggregations – for instance,
you may want the sum of columns A
, B
, C
and D
. You could say:
f.A + f.B + f.C + f.D
. Rowsum makes it easier – dt.rowsum(f['A':'D'])
.
rowall, rowany¶
These work only on boolean expressions – rowall()
checks if all the
values in the row are True
, while rowany()
checks if any value in the
row is True
. It is similar to pandas’ all or any
with parameter (axis=1)
. A single boolean column is returned:
from datatable import dt, f, by
df = dt.Frame({'A': [True, True], 'B': [True, False]})
df
A | B | ||
---|---|---|---|
bool8 | bool8 | ||
0 | 1 | 1 | |
1 | 1 | 0 |
# rowall :
df[:, dt.rowall(f[:])]
C0 | ||
---|---|---|
bool8 | ||
0 | 1 | |
1 | 0 |
# rowany :
df[:, dt.rowany(f[:])]
C0 | ||
---|---|---|
bool8 | ||
0 | 1 | |
1 | 1 |
The single boolean column that is returned can be very handy when filtering in the i
section.
Filter for rows where at least one cell is greater than 0:
df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0],
'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
df
a | b | c | d | e | f | ||
---|---|---|---|---|---|---|---|
int8 | int8 | int32 | int8 | int8 | int8 | ||
0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 1 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | |
4 | 0 | 0 | 0 | 0 | 0 | 0 | |
5 | 0 | 0 | 5 | 0 | 0 | 0 | |
6 | 1 | 0 | 0 | 0 | 0 | 0 | |
7 | 0 | 0 | 0 | 0 | 0 | 0 | |
8 | 0 | 0 | 0 | 1 | 0 | 0 | |
9 | 1 | 0 | 0 | 0 | 0 | 0 | |
10 | 0 | 0 | 0 | 0 | 0 | 0 |
df[dt.rowany(f[:] > 0), :]
a | b | c | d | e | f | ||
---|---|---|---|---|---|---|---|
int8 | int8 | int32 | int8 | int8 | int8 | ||
0 | 0 | 0 | 0 | 0 | 0 | 1 | |
1 | 0 | 0 | 5 | 0 | 0 | 0 | |
2 | 1 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 1 | 0 | 0 | |
4 | 1 | 0 | 0 | 0 | 0 | 0 |
Filter for rows where all the cells are 0:
df[dt.rowall(f[:] == 0), :]
a | b | c | d | e | f | ||
---|---|---|---|---|---|---|---|
int8 | int8 | int32 | int8 | int8 | int8 | ||
0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | |
4 | 0 | 0 | 0 | 0 | 0 | 0 | |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
Filter for rows where all the columns’ values are the same:
df = dt.Frame("""Name A1 A2 A3 A4
deff 0 0 0 0
def1 0 1 0 0
def2 0 0 0 0
def3 1 0 0 0
def4 0 0 0 0""")
# compare the first integer column with the rest,
# use rowall to find rows where all is True
# and filter with the resulting boolean
df[dt.rowall(f[1]==f[1:]), :]
Name | A1 | A2 | A3 | A4 | ||
---|---|---|---|---|---|---|
str32 | bool8 | bool8 | bool8 | bool8 | ||
0 | deff | 0 | 0 | 0 | 0 | |
1 | def2 | 0 | 0 | 0 | 0 | |
2 | def4 | 0 | 0 | 0 | 0 |
Filter for rows where the values are increasing:
df = dt.Frame({"A": [1, 2, 6, 4],
"B": [2, 4, 5, 6],
"C": [3, 5, 4, 7],
"D": [4, -3, 3, 8],
"E": [5, 1, 2, 9]})
df
A | B | C | D | E | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | 5 | |
1 | 2 | 4 | 5 | -3 | 1 | |
2 | 6 | 5 | 4 | 3 | 2 | |
3 | 4 | 6 | 7 | 8 | 9 |
df[dt.rowall(f[1:] >= f[:-1]), :]
A | B | C | D | E | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 4 | 5 | |
1 | 4 | 6 | 7 | 8 | 9 |
rowfirst, rowlast¶
These look for the first and last non-missing value in a row respectively:
df = dt.Frame({'A':[1, None, None, None],
'B':[None, 3, 4, None],
'C':[2, None, 5, None]})
df
A | B | C | ||
---|---|---|---|---|
int8 | int32 | int32 | ||
0 | 1 | NA | 2 | |
1 | NA | 3 | NA | |
2 | NA | 4 | 5 | |
3 | NA | NA | NA |
# rowfirst :
df[:, dt.rowfirst(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 3 | |
2 | 4 | |
3 | NA |
# rowlast :
df[:, dt.rowlast(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 2 | |
1 | 3 | |
2 | 5 | |
3 | NA |
Get rows where the last value in the row is greater than the first value in the row:
df = dt.Frame({'a': [50, 40, 30, 20, 10],
'b': [60, 10, 40, 0, 5],
'c': [40, 30, 20, 30, 40]})
df
a | b | c | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 50 | 60 | 40 | |
1 | 40 | 10 | 30 | |
2 | 30 | 40 | 20 | |
3 | 20 | 0 | 30 | |
4 | 10 | 5 | 40 |
df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :]
a | b | c | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 20 | 0 | 30 | |
1 | 10 | 5 | 40 |
rowmax, rowmin¶
These get the maximum and minimum values per row, respectively:
df = dt.Frame({"C": [2, 5, 30, 20, 10],
"D": [10, 8, 20, 20, 1]})
df
C | D | ||
---|---|---|---|
int32 | int32 | ||
0 | 2 | 10 | |
1 | 5 | 8 | |
2 | 30 | 20 | |
3 | 20 | 20 | |
4 | 10 | 1 |
# rowmax
df[:, dt.rowmax(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 10 | |
1 | 8 | |
2 | 30 | |
3 | 20 | |
4 | 10 |
# rowmin
df[:, dt.rowmin(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 2 | |
1 | 5 | |
2 | 20 | |
3 | 20 | |
4 | 1 |
Find the difference between the maximum and minimum of each row:
df = dt.Frame("""Value1 Value2 Value3 Value4
5 4 3 2
4 3 2 1
3 3 5 1""")
df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))]
df
Value1 | Value2 | Value3 | Value4 | max_min | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 5 | 4 | 3 | 2 | 3 | |
1 | 4 | 3 | 2 | 1 | 3 | |
2 | 3 | 3 | 5 | 1 | 4 |
rowsum, rowmean, rowcount, rowsd¶
rowsum()
and rowmean()
get the sum and mean of rows respectively;
rowcount()
counts the number of non-missing values in a row, while
rowsd()
aggregates a row to get the standard deviation.
Get the count, sum, mean and standard deviation for each row:
df = dt.Frame("""ORD A B C D
198 23 45 NaN 12
138 25 NaN NaN 62
625 52 36 49 35
457 NaN NaN NaN 82
626 52 32 39 45""")
df[:, dt.update(rowcount = dt.rowcount(f[:]),
rowsum = dt.rowsum(f[:]),
rowmean = dt.rowmean(f[:]),
rowsd = dt.rowsd(f[:])
)]
df
ORD | A | B | C | D | rowcount | rowsum | rowmean | rowsd | ||
---|---|---|---|---|---|---|---|---|---|---|
int32 | float64 | float64 | float64 | int32 | int32 | float64 | float64 | float64 | ||
0 | 198 | 23 | 45 | NA | 12 | 4 | 278 | 69.5 | 86.7583 | |
1 | 138 | 25 | NA | NA | 62 | 3 | 225 | 75 | 57.6108 | |
2 | 625 | 52 | 36 | 49 | 35 | 5 | 797 | 159.4 | 260.389 | |
3 | 457 | NA | NA | NA | 82 | 2 | 539 | 269.5 | 265.165 | |
4 | 626 | 52 | 32 | 39 | 45 | 5 | 794 | 158.8 | 261.277 |
Find rows where the number of nulls is greater than 3:
df = dt.Frame({'city': ["city1", "city2", "city3", "city4"],
'state': ["state1", "state2", "state3", "state4"],
'2005': [144, 205, 123, None],
'2006': [173, 211, 123, 124],
'2007': [None, None, None, None],
'2008': [None, 206, None, None],
'2009': [None, None, 124, 123],
'2010': [128, 273, None, None]})
df
city | state | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | ||
---|---|---|---|---|---|---|---|---|---|
str32 | str32 | int32 | int32 | void | int32 | int32 | int32 | ||
0 | city1 | state1 | 144 | 173 | NA | NA | NA | 128 | |
1 | city2 | state2 | 205 | 211 | NA | 206 | NA | 273 | |
2 | city3 | state3 | 123 | 123 | NA | NA | 124 | NA | |
3 | city4 | state4 | NA | 124 | NA | NA | 123 | NA |
# get columns that are null, then sum on the rows
# and finally filter where the sum is greater than 3
df[dt.rowsum(dt.isna(f[:])) > 3, :]
city | state | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | ||
---|---|---|---|---|---|---|---|---|---|
str32 | str32 | int32 | int32 | void | int32 | int32 | int32 | ||
0 | city4 | state4 | NA | 124 | NA | NA | 123 | NA |
Rowwise sum of the float columns:
df = dt.Frame("""ID W_1 W_2 W_3
1 0.1 0.2 0.3
1 0.2 0.4 0.5
2 0.3 0.3 0.2
2 0.1 0.3 0.4
2 0.2 0.0 0.5
1 0.5 0.3 0.2
1 0.4 0.2 0.1""")
df[:, dt.update(sum_floats = dt.rowsum(f[float]))]
df
ID | W_1 | W_2 | W_3 | sum_floats | ||
---|---|---|---|---|---|---|
int32 | float64 | float64 | float64 | float64 | ||
0 | 1 | 0.1 | 0.2 | 0.3 | 0.6 | |
1 | 1 | 0.2 | 0.4 | 0.5 | 1.1 | |
2 | 2 | 0.3 | 0.3 | 0.2 | 0.8 | |
3 | 2 | 0.1 | 0.3 | 0.4 | 0.8 | |
4 | 2 | 0.2 | 0 | 0.5 | 0.7 | |
5 | 1 | 0.5 | 0.3 | 0.2 | 1 | |
6 | 1 | 0.4 | 0.2 | 0.1 | 0.7 |
More Examples¶
Divide columns A
, B
, C
, D
by the total
column, square it
and sum rowwise:
df = dt.Frame({'A': [2, 3],
'B': [1, 2],
'C': [0, 1],
'D': [1, 0],
'total': [4, 6]})
df
A | B | C | D | total | ||
---|---|---|---|---|---|---|
int32 | int32 | int8 | int8 | int32 | ||
0 | 2 | 1 | 0 | 1 | 4 | |
1 | 3 | 2 | 1 | 0 | 6 |
df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))]
df
A | B | C | D | total | result | ||
---|---|---|---|---|---|---|---|
int32 | int32 | int8 | int8 | int32 | float64 | ||
0 | 2 | 1 | 0 | 1 | 4 | 0.375 | |
1 | 3 | 2 | 1 | 0 | 6 | 0.388889 |
Get the row sum of the COUNT
columns:
df = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3
A 1 0 1 1
A 2 1 1 2
A 3 3 0 0""")
columns = [f[column] for column in df.names if column.startswith("COUNT")]
df[:, update(total = dt.rowsum(columns))]
df
USER | OBSERVATION | COUNT.1 | COUNT.2 | COUNT.3 | total | ||
---|---|---|---|---|---|---|---|
str32 | int32 | int32 | bool8 | int32 | int32 | ||
0 | A | 1 | 0 | 1 | 1 | 2 | |
1 | A | 2 | 1 | 1 | 2 | 4 | |
2 | A | 3 | 3 | 0 | 0 | 3 |
Sum selected columns rowwise:
df = dt.Frame({'location' : ("a","b","c","d"),
'v1' : (3,4,3,3),
'v2' : (4,56,3,88),
'v3' : (7,6,2,9),
'v4': (7,6,1,9),
'v5' : (4,4,7,9),
'v6' : (2,8,4,6)})
df
location | v1 | v2 | v3 | v4 | v5 | v6 | ||
---|---|---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | int32 | int32 | ||
0 | a | 3 | 4 | 7 | 7 | 4 | 2 | |
1 | b | 4 | 56 | 6 | 6 | 4 | 8 | |
2 | c | 3 | 3 | 2 | 1 | 7 | 4 | |
3 | d | 3 | 88 | 9 | 9 | 9 | 6 |
df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}]
x1 | x2 | ||
---|---|---|---|
int32 | int32 | ||
0 | 14 | 13 | |
1 | 66 | 18 | |
2 | 8 | 12 | |
3 | 100 | 24 |
Comparison with pandas¶
A lot of potential datatable
users are likely to have some familiarity
with pandas; as such, this page provides some examples of how various
pandas operations can be performed within datatable. The datatable module
emphasizes speed and big data support (an area that pandas struggles with);
it also has an expressive and concise syntax, which makes datatable also
useful for small datasets.
Note: in pandas, there are two fundamental data structures: Series and
DataFrame. In datatable, there is only one fundamental data structure —
the Frame
. Most of the comparisons will be between pandas DataFrame
and datatable Frame
.
import pandas as pd
import numpy as np
from datatable import dt, f, by, g, join, sort, update, ifelse
data = {"A": [1, 2, 3, 4, 5],
"B": [4, 5, 6, 7, 8],
"C": [7, 8, 9, 10, 11],
"D": [5, 7, 2, 9, -1]}
# datatable
DT = dt.Frame(data)
# pandas
df = pd.DataFrame(data)
Row and Column Selection¶
pandas | datatable | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Select a single row | |||||||||||||||||||||||||||||||||||||||||||
df.loc[2]
| DT[2, :]
| ||||||||||||||||||||||||||||||||||||||||||
Select several rows by their indices | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[[2, 3, 4]]
| DT[[2, 3, 4], :]
| ||||||||||||||||||||||||||||||||||||||||||
Select a slice of rows by position | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[2:5]
# or
df.iloc[range(2, 5)]
| DT[2:5, :]
# or
DT[range(2, 5), :]
| ||||||||||||||||||||||||||||||||||||||||||
Select every second row | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[::2]
| DT[::2, :]
| ||||||||||||||||||||||||||||||||||||||||||
Select rows using a boolean mask | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[[True, True, False, False, True]]
| DT[[True, True, False, False, True], :]
| ||||||||||||||||||||||||||||||||||||||||||
Select rows on a condition | |||||||||||||||||||||||||||||||||||||||||||
df.loc[df['A']>3]
| DT[f.A>3, :]
| ||||||||||||||||||||||||||||||||||||||||||
Select rows on multiple conditions, using OR | |||||||||||||||||||||||||||||||||||||||||||
df.loc[(df['A'] > 3) | (df['B']<5)]
| DT[(f.A>3) | (f.B<5), :]
| ||||||||||||||||||||||||||||||||||||||||||
Select rows on multiple conditions, using AND | |||||||||||||||||||||||||||||||||||||||||||
df.loc[(df['A'] > 3) & (df['B']<8)]
| DT[(f.A>3) & (f.B<8), :]
| ||||||||||||||||||||||||||||||||||||||||||
Select a single column by column name | |||||||||||||||||||||||||||||||||||||||||||
df['A']
df.loc[:, 'A']
| DT['A']
DT[:, 'A']
| ||||||||||||||||||||||||||||||||||||||||||
Select a single column by position | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[:, 1]
| DT[1]
DT[:, 1]
| ||||||||||||||||||||||||||||||||||||||||||
Select multiple columns by column names | |||||||||||||||||||||||||||||||||||||||||||
df.loc[:, ["A", "B"]]
| DT[:, ["A", "B"]]
| ||||||||||||||||||||||||||||||||||||||||||
Select multiple columns by position | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[:, [0, 1]]
| DT[:, [0, 1]]
| ||||||||||||||||||||||||||||||||||||||||||
Select multiple columns by slicing | |||||||||||||||||||||||||||||||||||||||||||
df.loc[:, "A":"B"]
| DT[:, "A":"B"]
| ||||||||||||||||||||||||||||||||||||||||||
Select multiple columns by position | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[:, 1:3]
| DT[:, 1:3]
| ||||||||||||||||||||||||||||||||||||||||||
Select columns by Boolean mask | |||||||||||||||||||||||||||||||||||||||||||
df.loc[:, [True,False,False,True]]
| DT[:, [True,False,False,True]]
| ||||||||||||||||||||||||||||||||||||||||||
Select multiple rows and columns | |||||||||||||||||||||||||||||||||||||||||||
df.loc[2:5, "A":"B"]
| DT[2:5, "A":"B"]
| ||||||||||||||||||||||||||||||||||||||||||
Select multiple rows and columns by position | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[2:5, :2]
| DT[2:5, :2]
| ||||||||||||||||||||||||||||||||||||||||||
Select a single value (returns a scalar) | |||||||||||||||||||||||||||||||||||||||||||
df.at[2, 'A']
df.loc[2, 'A']
| DT[2, "A"]
| ||||||||||||||||||||||||||||||||||||||||||
Select a single value by position | |||||||||||||||||||||||||||||||||||||||||||
df.iat[2, 0]
df.iloc[2, 0]
| DT[2, 0]
| ||||||||||||||||||||||||||||||||||||||||||
Select a single value, return as Series | |||||||||||||||||||||||||||||||||||||||||||
df.loc[2, ["A"]]
| DT[2, ["A"]]
| ||||||||||||||||||||||||||||||||||||||||||
Select a single value (return as Series/Frame) by position | |||||||||||||||||||||||||||||||||||||||||||
df.iloc[2, [0]]
| DT[2, [0]]
| ||||||||||||||||||||||||||||||||||||||||||
In pandas every frame has a row index, and if a filtration is executed, the row numbers are kept: df.loc[df['A'] > 3]
A B C D
3 4 7 10 9
4 5 8 11 -1
| Datatable has no notion of a row index; the row numbers displayed are just for convenience: DT[f.A > 3, :]
| ||||||||||||||||||||||||||||||||||||||||||
In pandas, the index can be numbers, or characters, or intervals, or even
df1 = df.set_index(pd.Index(['a','b','c','d','e']))
A B C D
a 1 4 7 5
b 2 5 8 7
c 3 6 9 2
d 4 7 10 9
e 5 8 11 -1
df1.loc["a":"c"]
A B C D
a 1 4 7 5
b 2 5 8 7
c 3 6 9 2
| Datatable has the data = {"A": [1, 2, 3, 4, 5],
"B": [4, 5, 6, 7, 8],
"C": [7, 8, 9, 10, 11],
"D": [5, 7, 2, 9, -1],
"E": ['a','b','c','d','e']}
DT1 = dt.Frame(data)
DT1.key = 'E'
DT1
DT1["a":"c", :] # this will fail
TypeError: A string slice cannot be used as a row selector
| ||||||||||||||||||||||||||||||||||||||||||
Pandas’ df1 = df.set_index('C')
A B D
C
7 1 4 5
8 2 5 7
9 3 6 2
10 4 7 9
11 5 8 -1
Selecting with df1.loc[7]
A 1
B 4
D 5
Name: 7, dtype: int64
However, selecting with df.iloc[7]
IndexError: single positional indexer is out-of-bounds
| Datatable has the DT.key = 'C'
DT
DT[7, :] # this will fail
ValueError: Row 7 is invalid for a frame with 5 rows
|
Add new/update existing columns¶
pandas | datatable |
---|---|
Add a new column with a scalar value | |
df['new_col'] = 2
df = df.assign(new_col = 2)
| DT['new_col'] = 2
DT[:, update(new_col=2)]
|
Add a new column with a list of values | |
df['new_col'] = range(len(df))
df = df.assign(new_col = range(len(df))
| DT['new_col_1'] = range(DT.nrows)
DT[:, update(new_col=range(DT.nrows)]
|
Update a single value | |
df.at[2, 'new_col'] = 200
| DT[2, 'new_col'] = 200
|
Update an entire column | |
df.loc[:, "A"] = 5 # or
df["A"] = 5
df = df.assign(A = 5)
| DT["A"] = 5
DT[:, update(A = 5)]
|
Update multiple columns | |
df.loc[:, "A":"C"] = np.arange(15).reshape(-1,3)
| DT[:, "A":"C"] = np.arange(15).reshape(-1,3)
|
Note
In datatable, the update()
method is in-place; reassigment to
the Frame DT
is not required.
Rename columns¶
pandas | datatable |
---|---|
Rename a column | |
df = df.rename(columns={"A": "col_A"})
| DT.names = {"A": "col_A"}
|
Rename multiple columns | |
df = df.rename(columns={"A": "col_A", "B": "col_B"})
| DT.names = {"A": "col_A", "B": "col_B"}
|
In datatable, you can select and rename columns at the same time, by passing
a dictionary of f-expressions into the j
section:
# datatable
DT[:, {"A": f.A, "box": f.B, "C": f.C, "D": f.D * 2}]
A | box | C | D | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 4 | 7 | 10 | |
1 | 2 | 5 | 8 | 14 | |
2 | 3 | 6 | 9 | 4 | |
3 | 4 | 7 | 10 | 18 | |
4 | 5 | 8 | 11 | -2 |
Delete Columns¶
pandas | datatable |
---|---|
Delete a column | |
del df['B']
| del DT['B']
|
Same as above | |
df = df.drop('B', axis=1)
| DT = DT[:, f[:].remove(f.B)]
|
Remove multiple columns | |
df = df.drop(['B', 'C'], axis=1)
| del DT[: , ['B', 'C']] # or
DT = DT[:, f[:].remove([f.B, f.C])]
|
Sorting¶
pandas | datatable |
---|---|
Sort by a column – default ascending | |
df.sort_values('A')
| DT.sort('A') # or
DT[:, : , sort('A')]
|
Sort by a column – descending | |
df.sort_values('A',ascending=False)
| DT.sort(-f.A) # or
DT[:, :, sort(-f.A)] # or
DT[:, :, sort('A', reverse=True)]
|
Sort by multiple columns – default ascending | |
df.sort_values(['A', 'C'])
| DT.sort('A', 'C') # or
DT[:, :, sort('A', 'C')]
|
Sort by multiple columns – both descending | |
df.sort_values(['A','C'],ascending=[False,False])
| DT.sort(-f.A, -f.C) # or
DT[:, :, sort(-f.A, -f.C)] # or
DT[:, :, sort('A', 'C', reverse=[True, True])]
|
Sort by multiple columns – different sort directions | |
df.sort_values(['A', 'C'], ascending=[True, False])
| DT.sort(f.A, -f.C) # or
DT[:, :, sort(f.A, -f.C)] # or
DT[:, :, sort('A', 'C', reverse=[False, True])]
|
Note
By default, pandas puts NAs last in the sorted data, while datatable puts them first.
Note
In pandas, there is an option to sort with a Callable; this option is not supported in datatable.
Note
In pandas, you can sort on the rows or columns; in datatable sorting is column-wise only.
Grouping and Aggregation¶
data = {"a": [1, 1, 2, 1, 2],
"b": [2, 20, 30, 2, 4],
"c": [3, 30, 50, 33, 50]}
# pandas
df = pd.DataFrame(data)
# datatable
DT = dt.Frame(data)
DT
a | b | c | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | |
1 | 1 | 20 | 30 | |
2 | 2 | 30 | 50 | |
3 | 1 | 2 | 33 | |
4 | 2 | 4 | 50 |
pandas | datatable |
---|---|
Group by column | |
df.groupby("a").agg("sum")
| DT[:, dt.sum(f[:]), by("a")]
|
Group by | |
df.groupby(["a", "b"]).agg("sum")
| DT[:, dt.sum(f.c), by("a", "b")]
|
Get size per group | |
df.groupby("a").size()
| DT[:, dt.count(), by("a")]
|
Grouping with multiple aggregation functions | |
df.groupby("a").agg({"b": "sum", "c": "mean"})
| DT[:, {"b": dt.sum(f.b), "c": dt.mean(f.c)}, by("a")]
|
Get the first row per group | |
df.groupby("a").first()
| DT[0, :, by("a")]
|
Get the last row per group | |
df.groupby('a').last()
| DT[-1, :, by("a")]
|
Get the first two rows per group | |
df.groupby("a").head(2)
| DT[:2, :, by("a")]
|
Get the last two rows per group | |
df.groupby("a").tail(2)
| DT[-2:, :, by("a")]
|
Transformations within groups in pandas is done using the pd.transform function:
# pandas
grouping = df.groupby("a")["b"].transform("min")
df.assign(min_b=grouping)
In datatable, transformations occur within the j
section; in the presence
of by()
, the computations within j
are per group:
# datatable
DT[:, f[:].extend({"min_b": dt.min(f.b)}), by("a")]
a | b | c | min_b | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 2 | |
1 | 1 | 20 | 30 | 2 | |
2 | 1 | 2 | 33 | 2 | |
3 | 2 | 30 | 50 | 4 | |
4 | 2 | 4 | 50 | 4 |
Note that the result above is sorted by the grouping column. If you want the
data to maintain the same shape as the source data, then update()
is
a better option (and usually faster):
# datatable
DT[:, update(min_b = dt.min(f.b)), by("a")]
DT
a | b | c | min_b | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 2 | 3 | 2 | |
1 | 1 | 20 | 30 | 2 | |
2 | 2 | 30 | 50 | 4 | |
3 | 1 | 2 | 33 | 2 | |
4 | 2 | 4 | 50 | 4 |
In pandas, some computations might require creating the column first before aggregation within a groupby. Take the example below, where we need to calculate the revenue per group:
data = {'shop': ['A', 'B', 'A'],
'item_price': [123, 921, 28],
'item_sold': [1, 2, 4]}
df1 = pd.DataFrame(data) # pandas
DT1 = dt.Frame(data) # datatable
DT1
shop | item_price | item_sold | ||
---|---|---|---|---|
str32 | int32 | int32 | ||
0 | A | 123 | 1 | |
1 | B | 921 | 2 | |
2 | A | 28 | 4 |
To get the total revenue, we first need to create a revenue column, then sum it in the groupby:
# pandas
df1['revenue'] = df1['item_price'] * df1['item_sold']
df1.groupby("shop")['revenue'].sum().reset_index()
In datatable, there is no need to create a temporary column; you can easily
nest your computations in the j
section; the computations will be
executed per group:
# datatable
DT1[:, {"revenue": dt.sum(f.item_price * f.item_sold)}, by("shop")]
shop | revenue | ||
---|---|---|---|
str32 | int64 | ||
0 | A | 235 | |
1 | B | 1842 |
You can learn more about the by()
function at the
Grouping with by() documentation.
Concatenate¶
In pandas you can combine multiple dataframes using the concatenate()
method; the concatenation is based on the indices:
# pandas
df1 = pd.DataFrame({"A": ["a", "a", "a"], "B": range(3)})
df2 = pd.DataFrame({"A": ["b", "b", "b"], "B": range(4, 7)})
By default, pandas concatenates the rows, with one dataframe on top of the other:
pd.concat([df1, df2], axis = 0)
The same functionality can be replicated in datatable using the
dt.Frame.rbind()
method:
# datatable
DT1 = dt.Frame(df1)
DT2 = dt.Frame(df2)
dt.rbind(DT1, DT2)
A | B | ||
---|---|---|---|
str32 | int64 | ||
0 | a | 0 | |
1 | a | 1 | |
2 | a | 2 | |
3 | b | 4 | |
4 | b | 5 | |
5 | b | 6 |
Notice how in pandas the indices are preserved (you can get rid of the indices
with the ignore_index
argument), whereas in datatable the indices are not
referenced.
To combine data across the columns, in pandas, you set the axis argument to
columns
:
# pandas
df1 = pd.DataFrame({"A": ["a", "a", "a"], "B": range(3)})
df2 = pd.DataFrame({"C": ["b", "b", "b"], "D": range(4, 7)})
df3 = pd.DataFrame({"E": ["c", "c", "c"], "F": range(7, 10)})
pd.concat([df1, df2, df3], axis = 1)
In datatable, you combine frames along the columns using the
dt.Frame.cbind()
method:
DT1 = dt.Frame(df1)
DT2 = dt.Frame(df2)
DT3 = dt.Frame(df3)
dt.cbind([DT1, DT2, DT3])
A | B | C | D | E | F | ||
---|---|---|---|---|---|---|---|
str32 | int64 | str32 | int64 | str32 | int64 | ||
0 | a | 0 | b | 4 | c | 7 | |
1 | a | 1 | b | 5 | c | 8 | |
2 | a | 2 | b | 6 | c | 9 |
In pandas, if you concatenate dataframes along the rows, and the columns do not match, a dataframe of all the columns is returned, with null values for the missing rows:
# pandas
pd.concat([df1, df2, df3], axis = 0)
In datatable, if you concatenate along the rows and the columns in the frames
do not match, you get an error message; you can however force the row
combinations, by passing force=True
:
# datatable
dt.rbind([DT1, DT2, DT3], force=True)
A | B | C | D | E | F | ||
---|---|---|---|---|---|---|---|
str32 | int64 | str32 | int64 | str32 | int64 | ||
0 | a | 0 | NA | NA | NA | NA | |
1 | a | 1 | NA | NA | NA | NA | |
2 | a | 2 | NA | NA | NA | NA | |
3 | NA | NA | b | 4 | NA | NA | |
4 | NA | NA | b | 5 | NA | NA | |
5 | NA | NA | b | 6 | NA | NA | |
6 | NA | NA | NA | NA | c | 7 | |
7 | NA | NA | NA | NA | c | 8 | |
8 | NA | NA | NA | NA | c | 9 |
Join/merge¶
pandas has a variety of options for joining dataframes, using the join
or merge
method; in datatable, only the left join is possible, and there
are certain limitations. You have to set keys on the dataframe to be joined,
and for that, the keyed columns must be unique. The main function in datatable
for joining dataframes based on column values is the join()
function.
As such, our comparison will be limited to left-joins only.
In pandas, you can join dataframes easily with the merge
method:
df1 = pd.DataFrame({"x" : ["b"]*3 + ["a"]*3 + ["c"]*3,
"y" : [1, 3, 6] * 3,
"v" : range(1, 10)})
df2 = pd.DataFrame({"x": ('c','b'),
"v": (8,7),
"foo": (4,2)})
df1.merge(df2, on="x", how="left")
In datatable, there are limitations currently. First, the joining dataframe must be keyed. Second, the values in the column(s) used as the joining key(s) must be unique, otherwise the keying operation will fail. Third, the join columns must have the same name.
DT1 = dt.Frame(df1)
DT2 = dt.Frame(df2)
# set key on DT2
DT2.key = 'x'
DT1[:, :, join(DT2)]
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int64 | int64 | int64 | int64 | ||
0 | b | 1 | 1 | 7 | 2 | |
1 | b | 3 | 2 | 7 | 2 | |
2 | b | 6 | 3 | 7 | 2 | |
3 | a | 1 | 4 | NA | NA | |
4 | a | 3 | 5 | NA | NA | |
5 | a | 6 | 6 | NA | NA | |
6 | c | 1 | 7 | 8 | 4 | |
7 | c | 3 | 8 | 8 | 4 | |
8 | c | 6 | 9 | 8 | 4 |
More details about joins in datatable can be found at the join()
API
and have a look at the Tutorial on the join operator.
More examples¶
This section shows how some solutions in pandas can be translated to datatable; the examples used here, as well as the pandas solutions, are from the pandas cookbook.
Feel free to submit a pull request on github for examples you would like to share with the community.
if-then-else¶
# Initial data frame:
df = pd.DataFrame({"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50]})
df
In pandas this can be achieved using numpy’s where():
df['logic'] = np.where(df['AAA'] > 5, 'high', 'low')
In datatable, this can be solved using the ifelse()
function:
# datatable
DT = dt.Frame(df)
DT["logic"] = ifelse(f.AAA > 5, "high", "low")
DT
AAA | BBB | CCC | logic | ||
---|---|---|---|---|---|
int64 | int64 | int64 | str32 | ||
0 | 4 | 10 | 100 | low | |
1 | 5 | 20 | 50 | low | |
2 | 6 | 30 | -30 | high | |
3 | 7 | 40 | -50 | high |
Select rows with data closest to certain value¶
# pandas
df = pd.DataFrame({"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50]})
aValue = 43.0
Solution in pandas, using argsort:
df.loc[(df.CCC - aValue).abs().argsort()]
In datatable, the sort()
function can be used to rearrange rows in the
desired order:
DT = dt.Frame(df)
DT[:, :, sort(dt.math.abs(f.CCC - aValue))]
AAA | BBB | CCC | ||
---|---|---|---|---|
int64 | int64 | int64 | ||
0 | 5 | 20 | 50 | |
1 | 4 | 10 | 100 | |
2 | 6 | 30 | -30 | |
3 | 7 | 40 | -50 |
Efficiently and dynamically creating new columns using applymap¶
# pandas
df = pd.DataFrame({"AAA": [1, 2, 1, 3],
"BBB": [1, 1, 2, 2],
"CCC": [2, 1, 3, 1]})
source_cols = df.columns
new_cols = [str(x) + "_cat" for x in source_cols]
categories = {1: 'Alpha', 2: 'Beta', 3: 'Charlie'}
df[new_cols] = df[source_cols].applymap(categories.get)
df
We can replicate the solution above in datatable:
# datatable
import itertools as it
DT = dt.Frame(df)
mixer = it.product(DT.names, categories)
conditions = [(name, f[name] == value, categories[value])
for name, value in mixer]
for name, cond, value in conditions:
DT[cond, f"{name}_cat"] = value
AAA | BBB | CCC | AAA_cat | BBB_cat | CCC_cat | ||
---|---|---|---|---|---|---|---|
int64 | int64 | int64 | str32 | str32 | str32 | ||
0 | 1 | 1 | 2 | Alpha | Alpha | Beta | |
1 | 2 | 1 | 1 | Beta | Alpha | Alpha | |
2 | 1 | 2 | 3 | Alpha | Beta | Charlie | |
3 | 3 | 2 | 1 | Charlie | Beta | Alpha |
Keep other columns when using min()
with groupby¶
# pandas
df = pd.DataFrame({'AAA': [1, 1, 1, 2, 2, 2, 3, 3],
'BBB': [2, 1, 3, 4, 5, 1, 2, 3]})
df
Solution in pandas:
df.loc[df.groupby("AAA")["BBB"].idxmin()]
In datatable, you can sort()
within a group, to achieve the same result above:
# datatable
DT = dt.Frame(df)
DT[0, :, by("AAA"), sort(f.BBB)]
AAA | BBB | ||
---|---|---|---|
int64 | int64 | ||
0 | 1 | 1 | |
1 | 2 | 1 | |
2 | 3 | 2 |
Apply to different items in a group¶
# pandas
df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
'size': list('SSMMMLL'),
'weight': [8, 10, 11, 1, 20, 12, 12],
'adult': [False] * 5 + [True] * 2})
df
Solution in pandas:
def GrowUp(x):
avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
avg_weight += sum(x[x['size'] == 'L'].weight)
avg_weight /= len(x)
return pd.Series(['L', avg_weight, True],
index=['size', 'weight', 'adult'])
expected_df = gb.apply(GrowUp)
In datatable, we can use the ifelse()
function to replicate
the solution above, since it is based on a series of conditions:
DT = dt.Frame(df)
conditions = ifelse(f.size == "S", f.weight * 1.5,
f.size == "M", f.weight * 1.25,
f.size == "L", f.weight,
None)
DT[:, {"size": "L",
"avg_wt": dt.sum(conditions) / dt.count(),
"adult": True},
by("animal")]
animal | size | avg_wt | adult | ||
---|---|---|---|---|---|
str32 | str32 | float64 | bool8 | ||
0 | cat | L | 12.4375 | 1 | |
1 | dog | L | 20 | 1 | |
2 | fish | L | 1.25 | 1 |
Note
ifelse()
can take multiple conditions, along with a default
return value.
Note
Custom functions are not supported in datatable yet.
Sort groups by aggregated data¶
# pandas
df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
'flag': [False, True] * 3})
Solution in pandas:
code_groups = df.groupby('code')
agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
sorted_df = df.loc[agg_n_sort_order.index]
sorted_df
The solution above sorts the data based on the sum of the data
column per
group in the code
column.
We can replicate this in datatable:
DT = dt.Frame(df)
DT[:, update(sum_data = dt.sum(f.data)), by("code")]
DT[:, :-1, sort(f.sum_data)]
code | data | flag | ||
---|---|---|---|---|
str32 | float64 | bool8 | ||
0 | bar | -0.21 | 1 | |
1 | bar | -0.59 | 0 | |
2 | foo | 0.16 | 0 | |
3 | foo | 0.45 | 1 | |
4 | baz | 0.33 | 0 | |
5 | baz | 0.62 | 1 |
Create a value counts column and reassign back to the DataFrame¶
# pandas
df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
'Value': [100, 150, 50, 50]})
df
Solution in pandas:
df['Counts'] = df.groupby(['Color']).transform(len)
df
In datatable, you can replicate the solution above with the count()
function:
DT = dt.Frame(df)
DT[:, update(Counts=dt.count()), by("Color")]
DT
Color | Value | Counts | ||
---|---|---|---|---|
str32 | int64 | int64 | ||
0 | Red | 100 | 3 | |
1 | Red | 150 | 3 | |
2 | Red | 50 | 3 | |
3 | Blue | 50 | 1 |
Shift groups of the values in a column based on the index¶
# pandas
df = pd.DataFrame({'line_race': [10, 10, 8, 10, 10, 8],
'beyer': [99, 102, 103, 103, 88, 100]},
index=['Last Gunfighter', 'Last Gunfighter',
'Last Gunfighter', 'Paynter', 'Paynter',
'Paynter'])
df
Solution in pandas:
df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
df
Datatable has an equivalent shift()
function:
DT = dt.Frame(df.reset_index())
DT[:, update(beyer_shifted = dt.shift(f.beyer)), by("index")]
DT
index | line_race | beyer | beyer_shifted | ||
---|---|---|---|---|---|
str32 | int64 | int64 | int64 | ||
0 | Last Gunfighter | 10 | 99 | NA | |
1 | Last Gunfighter | 10 | 102 | 99 | |
2 | Last Gunfighter | 8 | 103 | 102 | |
3 | Paynter | 10 | 103 | NA | |
4 | Paynter | 10 | 88 | 103 | |
5 | Paynter | 8 | 100 | 88 |
Frequency table like plyr in R¶
grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
df = pd.DataFrame({'ID': ["x%d" % r for r in range(10)],
'Gender': ['F', 'M', 'F', 'M', 'F',
'M', 'F', 'M', 'M', 'M'],
'ExamYear': ['2007', '2007', '2007', '2008', '2008',
'2008', '2008', '2009', '2009', '2009'],
'Class': ['algebra', 'stats', 'bio', 'algebra',
'algebra', 'stats', 'stats', 'algebra',
'bio', 'bio'],
'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
'yes', 'yes', 'yes', 'yes', 'yes'],
'Passed': ['yes' if x > 50 else 'no' for x in grades],
'Employed': [True, True, True, False,
False, False, False, True, True, False],
'Grade': grades})
df
Solution in pandas:
df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
'Passed': lambda x: sum(x == 'yes'),
'Employed': lambda x: sum(x),
'Grade': lambda x: sum(x) / len(x)})
In datatable you can nest conditions within aggregations:
DT = dt.Frame(df)
DT[:, {"Participated": dt.sum(f.Participated == "yes"),
"Passed": dt.sum(f.Passed == "yes"),
"Employed": dt.sum(f.Employed),
"Grade": dt.mean(f.Grade)},
by("ExamYear")]
ExamYear | Participated | Passed | Employed | Grade | ||
---|---|---|---|---|---|---|
str32 | int64 | int64 | int64 | float64 | ||
0 | 2007 | 3 | 2 | 3 | 74 | |
1 | 2008 | 3 | 3 | 0 | 68.5 | |
2 | 2009 | 3 | 2 | 2 | 60.6667 |
Missing functionality¶
Listed below are some functions in pandas that do not have an equivalent in datatable yet, and are likely to be implemented:
Reshaping functions
Convenience function for filtering and subsetting
Missing values
Aggregation functions, such as
String functions, such as
If there are any functions that you would like to see in datatable, please head over to github and raise a feature request.
Comparison with R’s data.table¶
datatable
is closely related to R’s data.table and attempts to mimic
its API; however, there are differences due to language constraints.
This page shows how to perform similar basic operations in R’s data.table
versus datatable
.
Subsetting Rows¶
The examples used here are from the examples data in R’s data.table
.
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3),
y=c(1,3,6), v=1:9)
from datatable import dt, f, g, by, update, join, sort
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
DT
x | y | v | ||
---|---|---|---|---|
str32 | int32 | int32 | ||
0 | b | 1 | 1 | |
1 | b | 3 | 2 | |
2 | b | 6 | 3 | |
3 | a | 1 | 4 | |
4 | a | 3 | 5 | |
5 | a | 6 | 6 | |
6 | c | 1 | 7 | |
7 | c | 3 | 8 | |
8 | c | 6 | 9 |
Action |
data.table |
datatable |
---|---|---|
Select 2nd row |
|
|
Select 2nd and 3rd row |
|
|
Select 3rd and 2nd row |
|
|
Select 2nd and 5th rows |
|
|
Select all rows from 2nd to 5th |
|
|
Select rows in reverse from 5th to the 1st |
|
|
Select the last row |
|
|
All rows where |
|
|
Compound logical expressions |
|
|
All rows other than rows 2,3,4 |
|
|
Sort by column |
|
DT.sort("x") orDT[:, :, sort("x")] |
Sort by column |
|
DT.sort(-f.x) orDT[:, :, sort(-f.x)] |
Sort by column |
|
DT.sort(x, -f.y) orDT[:, :, sort(f.x, -f.y)] |
Note
Note the use of the f
symbol when performing computations or
sorting in descending order. You can read more about f-expressions.
Note
In R, DT[2]
would mean 2nd row, whereas in python DT[2]
would
select the 3rd column.
In data.table
, when selecting rows you do not need to indicate the columns.
So, something like the code below works fine:
# data.table
DT[y==3]
x y v
1: b 3 2
2: a 3 5
3: c 3 8
In datatable
, however, when selecting rows there has to be a column
selector, or you get an error:
DT[f.y == 3]
The code above fails because datatable
only allows single-column selection
using the style above:
DT['y']
y | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 3 | |
2 | 6 | |
3 | 1 | |
4 | 3 | |
5 | 6 | |
6 | 1 | |
7 | 3 | |
8 | 6 |
As such, when datatable
sees an f-expressions, it thinks you are
selecting a column, and appropriately errors out.
Since, in this case, we are selecting all columns, we can use either a colon
(:
) or the Ellipsis symbol(...
):
DT[f.y==3, :]
DT[f.y==3, ...]
Selecting columns¶
Action |
data.table |
datatable |
---|---|---|
Select column |
|
|
Select multiple columns |
|
|
Rename and select column |
|
|
Sum column |
|
|
Return two columns, |
|
|
Select the second column |
|
|
Select last column |
|
|
Select columns |
|
|
Exclude columns |
|
DT[:, [name not in ("x","y") for name in DT.names]] orDT[:, f[:].remove(f['x':'y'])] |
Select columns that start with |
|
DT[:, [name.startswith(("x", "v")) for name in DT.names]] |
In data.table
, you can select a column by using a variable name with the
double dots prefix:
col = 'v'
DT[, ..col]
In datatable
, you do not need the prefix:
col = 'v'
DT[:, col] # or DT[col]
v | ||
---|---|---|
float64 | ||
0 | 1 | |
1 | 1.41421 | |
2 | 1.73205 | |
3 | 2 | |
4 | 2.23607 | |
5 | 2.44949 | |
6 | 2.64575 | |
7 | 2.82843 | |
8 | 3 |
If the column names are stored in a character vector, the double dots prefix also works:
cols = c('v', 'y')
DT[, ..cols]
In datatable
, you can store the list/tuple of column names in a variable
cols = ['v', 'y']
DT[:, cols]
v | y | ||
---|---|---|---|
float64 | float64 | ||
0 | 1 | 1 | |
1 | 1.41421 | 1.73205 | |
2 | 1.73205 | 2.44949 | |
3 | 2 | 1 | |
4 | 2.23607 | 1.73205 | |
5 | 2.44949 | 2.44949 | |
6 | 2.64575 | 1 | |
7 | 2.82843 | 1.73205 | |
8 | 3 | 2.44949 |
Subset rows and Select/Aggregate¶
Action |
data.table |
datatable |
---|---|---|
Sum column |
|
|
Same as above, new column name |
|
|
Filter in |
|
|
Same as above, return as scalar |
|
|
In R indexing starts at 1 and when slicing, the first and the last items are both included. However, in Python, indexing starts at 0, and when slicing all items except the last are included.
Some SD
(Subset of Data) operations can be replicated in datatable
Aggregate several columns¶
DT[, lapply(.SD, mean), .SDcols = c("y","v")]
y v
1: 3.333333 5
DT[:, dt.mean([f.y,f.v])]
y | v | ||
---|---|---|---|
float64 | float64 | ||
0 | 3.33333 | 5 |
Modify columns using a condition¶
DT[, .SD - 1, .SDcols = is.numeric]
y v
1: 0 0
2: 2 1
3: 5 2
4: 0 3
5: 2 4
6: 5 5
7: 0 6
8: 2 7
9: 5 8
DT[:, f[int] - 1]
C0 | C1 | ||
---|---|---|---|
int32 | int32 | ||
0 | 0 | 0 | |
1 | 2 | 1 | |
2 | 5 | 2 | |
3 | 0 | 3 | |
4 | 2 | 4 | |
5 | 5 | 5 | |
6 | 0 | 6 | |
7 | 2 | 7 | |
8 | 5 | 8 |
Modify several columns and keep others unchanged¶
DT[, c("y", "v") := lapply(.SD, sqrt),
.SDcols = c("y", "v")]
x y v
1: b 1.000000 1.000000
2: b 1.732051 1.414214
3: b 2.449490 1.732051
4: a 1.000000 2.000000
5: a 1.732051 2.236068
6: a 2.449490 2.449490
7: c 1.000000 2.645751
8: c 1.732051 2.828427
9: c 2.449490 3.000000
# there is a square root function the datatable math module
DT[:, update(**{name:f[name]**0.5 for name in ("y","v")})]
DT
x | y | v | ||
---|---|---|---|---|
str32 | float64 | float64 | ||
0 | b | 1 | 1 | |
1 | b | 1.73205 | 1.41421 | |
2 | b | 2.44949 | 1.73205 | |
3 | a | 1 | 2 | |
4 | a | 1.73205 | 2.23607 | |
5 | a | 2.44949 | 2.44949 | |
6 | c | 1 | 2.64575 | |
7 | c | 1.73205 | 2.82843 | |
8 | c | 2.44949 | 3 |
Grouping with by()
¶
Action |
data.table |
datatable |
---|---|---|
Get the sum of column |
|
|
Get sum of |
|
|
Number of rows per group |
|
|
Select first row of |
|
|
Get row count and sum columns |
|
|
Expressions in |
|
|
Get row per group where column |
|
|
First 2 rows of each group |
|
|
Last 2 rows of each group |
|
|
In R’s data.table
, the order of the groupings is preserved; in
datatable
, the returned dataframe is sorted on the grouping column.
DT[, sum(v), keyby=x]
in data.table returns a dataframe ordered by
column x
.
In data.table
, i
is executed before the grouping, while in
datatable
, i
is executed after the grouping.
Also, in datatable
, f-expressions in the i
section of a
groupby is not yet implemented, hence the chaining method to get the sum of
column v
where x!=a
.
Multiple aggregations within a group can be executed in R’s data.table
with the syntax below:
DT[, list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x, y%%2)]
The same can be replicated in datatable
by using a dictionary:
DT[:, {'MySum': dt.sum(f.v),
'MyMin': dt.min(f.v),
'MyMax': dt.max(f.v)},
by(f.x, f.y%2)]
Add/Update/Delete Columns¶
Action |
data.table |
datatable |
---|---|---|
Add new column |
|
DT[:, update(z=42)] orDT['z'] = 42 orDT[:, 'z'] = 42 orDT = DT[:, f[:].extend({"z":42})] |
Add multiple columns |
|
DT[:, update(sv = dt.sum(f.v), mv = "X")] orDT[:, f[:].extend({"sv": dt.sum(f.v), "mv": "X"})] |
Remove column |
|
del DT['z'] ordel DT[:, 'z'] orDT = DT[:, f[:].remove(f.z)] |
Subassign to existing |
|
DT[f.x=="a", update(v=42)] orDT[f.x=="a", 'v'] = 42 |
Subassign to new column (NA padded) |
|
DT[f.x=="b", update(v2=84)] orDT[f.x=='b', 'v2'] = 84 |
Add new column, assigning values group-wise |
|
DT[:, update(m=dt.mean(f.v)), by("x")] |
In data.table
, you can create a new column with a variable
col = 'rar'
DT[, ..col:=4242]
Similar operation for the above in datatable
:
col = 'rar'
DT[col] = 4242
# or DT[:, update(col = 4242)]
Joins¶
At the moment, only the left outer join is implemented in datatable
.
Another aspect is that the dataframe being joined must be keyed, the column or
columns to be keyed must not have duplicates, and the joining column has to
have the same name in both dataframes. You can read more about the
join()
API and have a look at the Tutorial on join operators.
Left join in R’s data.table
:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X[DT, on="x"]
x v foo y i.v
1: b 7 2 1 1
2: b 7 2 3 2
3: b 7 2 6 3
4: a NA NA 1 4
5: a NA NA 3 5
6: a NA NA 6 6
7: c 8 4 1 7
8: c 8 4 3 8
9: c 8 4 6 9
Join in datatable
:
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
X = dt.Frame({"x":('c','b'),
"v":(8,7),
"foo":(4,2)})
X.key = "x"
DT[:, :, join(X)]
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 1 | 1 | 7 | 2 | |
1 | b | 3 | 2 | 7 | 2 | |
2 | b | 6 | 3 | 7 | 2 | |
3 | a | 1 | 4 | NA | NA | |
4 | a | 3 | 5 | NA | NA | |
5 | a | 6 | 6 | NA | NA | |
6 | c | 1 | 7 | 8 | 4 | |
7 | c | 3 | 8 | 8 | 4 | |
8 | c | 6 | 9 | 8 | 4 |
An inner join could be simulated by removing the nulls. Again, a join()
only works if the joining dataframe is keyed.
DT[X, on="x", nomatch=NULL]
x y v i.v foo
1: c 1 7 8 4
2: c 3 8 8 4
3: c 6 9 8 4
4: b 1 1 7 2
5: b 3 2 7 2
6: b 6 3 7 2
DT[g[-1] != None, :, join(X)] # g refers to the joining dataframe X
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 1 | 1 | 7 | 2 | |
1 | b | 3 | 2 | 7 | 2 | |
2 | b | 6 | 3 | 7 | 2 | |
3 | c | 1 | 7 | 8 | 4 | |
4 | c | 3 | 8 | 8 | 4 | |
5 | c | 6 | 9 | 8 | 4 |
A not join can be simulated as well:
DT[!X, on="x"]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
DT[g[-1]==None, f[:], join(X)]
x | y | v | ||
---|---|---|---|---|
str32 | int32 | int32 | ||
0 | a | 1 | 4 | |
1 | a | 3 | 5 | |
2 | a | 6 | 6 |
Select the first row for each group:
DT[X, on="x", mult="first"]
x y v i.v foo
1: c 1 7 8 4
2: b 1 1 7 2
DT[g[-1] != None, :, join(X)][0, :, by('x')] # chaining comes in handy here
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 1 | 1 | 7 | 2 | |
1 | c | 1 | 7 | 8 | 4 |
Select the last row for each group:
DT[X, on="x", mult="last"]
x y v i.v foo
1: c 6 9 8 4
2: b 6 3 7 2
DT[g[-1]!=None, :, join(X)][-1, :, by('x')]
x | y | v | v.0 | foo | ||
---|---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | ||
0 | b | 6 | 3 | 7 | 2 | |
1 | c | 6 | 9 | 8 | 4 |
Join and evaluate j
for each row in i
:
DT[X, sum(v), by=.EACHI, on="x"]
x V1
1: c 24
2: b 6
DT[g[-1]!=None, :, join(X)][:, dt.sum(f.v), by("x")]
x | v | ||
---|---|---|---|
str32 | int64 | ||
0 | b | 6 | |
1 | c | 24 |
Aggregate on columns from both dataframes in j
:
DT[X, sum(v)*foo, by=.EACHI, on="x"]
x V1
1: c 96
2: b 12
DT[:, dt.sum(f.v*g.foo), join(X), by(f.x)][f[-1]!=0, :]
x | C0 | ||
---|---|---|---|
str32 | int64 | ||
0 | b | 12 | |
1 | c | 96 |
Aggregate on columns with same name from both dataframes in j
:
DT[X, sum(v)*i.v, by=.EACHI, on="x"]
x V1
1: c 192
2: b 42
DT[:, dt.sum(f.v*g.v), join(X), by(f.x)][f[-1]!=0, :]
x | C0 | ||
---|---|---|---|
str32 | int64 | ||
0 | b | 42 | |
1 | c | 192 |
Expect significant improvement in join functionality, with more concise syntax, as well as additions of more features in the future.
Functions in R/data.table not yet implemented¶
This is a list of some functions in data.table
that do not have an
equivalent in datatable
yet, that we would likely implement
Reshaping functions
Convenience functions for filtering and subsetting
Duplicate functions
Aggregation functions
Missing values functions
Also, at the moment, custom aggregations in the j
section are not supported
in datatable
– we intend to implement that at some point.
There are no datetime functions in datatable
, and string operations are
limited as well.
If there are any functions that you would like to see in datatable
, please
head over to github and raise a feature request.
Comparison with SQL¶
This page provides some examples of how various SQL operations can be
performed in datatable
. The datatable
library is still growing; as such,
not all functions in SQL
can be replicated yet. If there is a feature you
would love to have in datatable
, please make a feature request on the
github issues page.
Most of the examples will be based on the famous iris dataset. SQLite will be the flavour of SQL used in the comparison.
Let’s import datatable
and read in the data using its fread()
function:
from datatable import dt, f, g, by, join, sort, update, fread
iris = fread('https://raw.githubusercontent.com/h2oai/datatable/main/docs/_static/iris.csv')
iris
sepal_length | sepal_width | petal_length | petal_width | species | ||
---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | ||
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa | |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa | |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | |
7 | 5 | 3.4 | 1.5 | 0.2 | setosa | |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | |
9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa | |
10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa | |
11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa | |
12 | 4.8 | 3 | 1.4 | 0.1 | setosa | |
13 | 4.3 | 3 | 1.1 | 0.1 | setosa | |
14 | 5.8 | 4 | 1.2 | 0.2 | setosa | |
… | … | … | … | … | … | |
145 | 6.7 | 3 | 5.2 | 2.3 | virginica | |
146 | 6.3 | 2.5 | 5 | 1.9 | virginica | |
147 | 6.5 | 3 | 5.2 | 2 | virginica | |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | |
149 | 5.9 | 3 | 5.1 | 1.8 | virginica |
Loading data into an SQL table is a bit more involved, where you need to create the structure of the table (a schema), before importing the csv file. Have a look at SQLite import tutorial for an example on loading data into a SQLite datatabase.
SELECT¶
In SQL
, you can select a subset of the columns with the SELECT
clause:
SELECT sepal_length,
sepal_width,
petal_length
FROM iris
LIMIT 5;
In datatable
, columns are selected in the j
section:
iris[:5, ['sepal_length', 'sepal_width', 'petal_length']]
sepal_length | sepal_width | petal_length | ||
---|---|---|---|---|
float64 | float64 | float64 | ||
0 | 5.1 | 3.5 | 1.4 | |
1 | 4.9 | 3 | 1.4 | |
2 | 4.7 | 3.2 | 1.3 | |
3 | 4.6 | 3.1 | 1.5 | |
4 | 5 | 3.6 | 1.4 |
In SQL
, you can select all columns with the *
symbol:
SELECT *
FROM iris
LIMIT 5;
In datatable
, all columns can be selected with a simple “select-all” slice
:
, or with f-expressions:
iris[:5, :]
sepal_length | sepal_width | petal_length | petal_width | species | ||
---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | ||
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa | |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
If you are selecting a single column, datatable
allows you to access just
the j
section within the square brackets; you do not need to include the
i
section: DT[j]
SELECT sepal_length
FROM iris
LIMIT 5;
# datatable
iris['sepal_length'].head(5)
sepal_length | ||
---|---|---|
float64 | ||
0 | 5.1 | |
1 | 4.9 | |
2 | 4.7 | |
3 | 4.6 | |
4 | 5 |
How about adding new columns? In SQL
, this is done also in the SELECT
clause:
SELECT *,
sepal_length*2 AS sepal_length_doubled
FROM iris
LIMIT 5;
In datatable
, addition of new columns occurs in the j
section:
iris[:5,
f[:].extend({"sepal_length_doubled": f.sepal_length * 2})]
sepal_length | sepal_width | petal_length | petal_width | species | sepal_length_doubled | ||
---|---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | float64 | ||
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 10.2 | |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa | 9.8 | |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 9.4 | |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 9.2 | |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa | 10 |
The update()
function can also be used to add new columns. The operation
occurs in-place; reassignment is not required:
iris[:, update(sepal_length_doubled = f.sepal_length * 2)]
iris[:5, :]
sepal_length | sepal_width | petal_length | petal_width | species | sepal_length_doubled | ||
---|---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | float64 | ||
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 10.2 | |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa | 9.8 | |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 9.4 | |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 9.2 | |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa | 10 |
WHERE¶
Filtering in SQL
is done via the WHERE
clause.
SELECT *
FROM iris
WHERE species = 'virginica'
LIMIT 5;
In datatable
, filtration is done in the i
section:
iris[f.species=="virginica", :].head(5)
sepal_length | sepal_width | petal_length | petal_width | species | sepal_length_doubled | ||
---|---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | float64 | ||
0 | 6.3 | 3.3 | 6 | 2.5 | virginica | 12.6 | |
1 | 5.8 | 2.7 | 5.1 | 1.9 | virginica | 11.6 | |
2 | 7.1 | 3 | 5.9 | 2.1 | virginica | 14.2 | |
3 | 6.3 | 2.9 | 5.6 | 1.8 | virginica | 12.6 | |
4 | 6.5 | 3 | 5.8 | 2.2 | virginica | 13 |
Note that in SQL
, equality comparison is done with the =
symbol,
whereas in python
, it is with the ==
operator. You can filter with
multple conditions too:
SELECT *
FROM iris
WHERE species = 'setosa' AND sepal_length = 5;
In datatable
each condition is wrapped in parentheses; the &
operator
is the equivalent of AND
, while |
is the equivalent of OR
:
iris[(f.species=="setosa") & (f.sepal_length==5), :]
sepal_length | sepal_width | petal_length | petal_width | species | sepal_length_doubled | ||
---|---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | float64 | ||
0 | 5 | 3.6 | 1.4 | 0.2 | setosa | 10 | |
1 | 5 | 3.4 | 1.5 | 0.2 | setosa | 10 | |
2 | 5 | 3 | 1.6 | 0.2 | setosa | 10 | |
3 | 5 | 3.4 | 1.6 | 0.4 | setosa | 10 | |
4 | 5 | 3.2 | 1.2 | 0.2 | setosa | 10 | |
5 | 5 | 3.5 | 1.3 | 0.3 | setosa | 10 | |
6 | 5 | 3.5 | 1.6 | 0.6 | setosa | 10 | |
7 | 5 | 3.3 | 1.4 | 0.2 | setosa | 10 |
Now suppose you have a frame where some values are missing (NA):
null_data = dt.Frame(""" a b c
1 2 3
1 NaN 4
2 1 3
1 2 2""")
null_data
a | b | c | ||
---|---|---|---|---|
int32 | float64 | int32 | ||
0 | 1 | 2 | 3 | |
1 | 1 | NA | 4 | |
2 | 2 | 1 | 3 | |
3 | 1 | 2 | 2 |
In SQL you could filter out those values like this:
SELECT *
FROM null_data
WHERE b is NOT NULL;
In datatable
, the NOT
operator is replicated with the !=
symbol:
null_data[f.b!=None, :]
a | b | c | ||
---|---|---|---|---|
int32 | float64 | int32 | ||
0 | 1 | 2 | 3 | |
1 | 2 | 1 | 3 | |
2 | 1 | 2 | 2 |
You could also use isna
function with the ~
operator
which inverts boolean expressions:
null_data[~dt.math.isna(f.b), :]
a | b | c | ||
---|---|---|---|---|
int32 | float64 | int32 | ||
0 | 1 | 2 | 3 | |
1 | 2 | 1 | 3 | |
2 | 1 | 2 | 2 |
Keeping the null rows is easily achievable; it is simply the inverse of the above code:
SELECT *
FROM null_data
WHERE b is NULL;
null_data[dt.isna(f.b), :]
a | b | c | ||
---|---|---|---|---|
int32 | float64 | int32 | ||
0 | 1 | NA | 4 |
Note
SQL
has the IN
operator, which does not have an equivalent in
datatable
yet.
ORDER BY¶
In SQL, sorting is executed with the ORDER BY
clause, while in datatable
it is handled by the sort()
function.
SELECT *
FROM iris
ORDER BY sepal_length ASC
limit 5;
iris[:5, :, sort('sepal_length')]
sepal_length | sepal_width | petal_length | petal_width | species | sepal_length_doubled | ||
---|---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | float64 | ||
0 | 4.3 | 3 | 1.1 | 0.1 | setosa | 8.6 | |
1 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | 8.8 | |
2 | 4.4 | 3 | 1.3 | 0.2 | setosa | 8.8 | |
3 | 4.4 | 3.2 | 1.3 | 0.2 | setosa | 8.8 | |
4 | 4.5 | 2.3 | 1.3 | 0.3 | setosa | 9 |
Sorting in descending order in SQL is with the DESC
.
SELECT *
FROM iris
ORDER BY sepal_length DESC
limit 5;
In datatable, this can be achieved in two ways:
iris[:5, :, sort('sepal_length', reverse=True)]
sepal_length | sepal_width | petal_length | petal_width | species | sepal_length_doubled | ||
---|---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | float64 | ||
0 | 7.9 | 3.8 | 6.4 | 2 | virginica | 15.8 | |
1 | 7.7 | 3.8 | 6.7 | 2.2 | virginica | 15.4 | |
2 | 7.7 | 2.6 | 6.9 | 2.3 | virginica | 15.4 | |
3 | 7.7 | 2.8 | 6.7 | 2 | virginica | 15.4 | |
4 | 7.7 | 3 | 6.1 | 2.3 | virginica | 15.4 |
or, you could negate the sorting column; datatable will correctly interprete the
negation(-
) as descending order:
iris[:5, :, sort(-f.sepal_length)]
sepal_length | sepal_width | petal_length | petal_width | species | sepal_length_doubled | ||
---|---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | float64 | ||
0 | 7.9 | 3.8 | 6.4 | 2 | virginica | 15.8 | |
1 | 7.7 | 3.8 | 6.7 | 2.2 | virginica | 15.4 | |
2 | 7.7 | 2.6 | 6.9 | 2.3 | virginica | 15.4 | |
3 | 7.7 | 2.8 | 6.7 | 2 | virginica | 15.4 | |
4 | 7.7 | 3 | 6.1 | 2.3 | virginica | 15.4 |
GROUP BY¶
SQL’s GROUP BY
operations can be performed in datatable
with the
by()
function. Have a look at the by()
API, as well as the
Grouping with by() user guide.
Let’s look at some common grouping operations in SQL
, and their equivalents
in datatable
.
Single aggregation per group¶
SELECT species,
COUNT() AS N
FROM iris
GROUP BY species;
iris[:, dt.count(), by('species')]
species | count | ||
---|---|---|---|
str32 | int64 | ||
0 | setosa | 50 | |
1 | versicolor | 50 | |
2 | virginica | 50 |
Multiple aggregations per group¶
SELECT species,
COUNT() AS N,
AVG(sepal_length) AS mean_sepal_length
FROM iris
GROUP BY species;
iris[:,
{"mean_sepal_length": dt.mean(f.sepal_length),
"N": dt.count()},
by('species')]
species | mean_sepal_length | N | ||
---|---|---|---|---|
str32 | float64 | int64 | ||
0 | setosa | 5.006 | 50 | |
1 | versicolor | 5.936 | 50 | |
2 | virginica | 6.588 | 50 |
Grouping on multiple columns¶
fruits_data
Fruit | Date | Name | Number | ||
---|---|---|---|---|---|
str32 | str32 | str32 | int32 | ||
0 | Apples | 10/6/2016 | Bob | 7 | |
1 | Apples | 10/6/2016 | Bob | 8 | |
2 | Apples | 10/6/2016 | Mike | 9 | |
3 | Apples | 10/7/2016 | Steve | 10 | |
4 | Apples | 10/7/2016 | Bob | 1 | |
5 | Oranges | 10/7/2016 | Bob | 2 | |
6 | Oranges | 10/6/2016 | Tom | 15 | |
7 | Oranges | 10/6/2016 | Mike | 57 | |
8 | Oranges | 10/6/2016 | Bob | 65 | |
9 | Oranges | 10/7/2016 | Tony | 1 | |
10 | Grapes | 10/7/2016 | Bob | 1 | |
11 | Grapes | 10/7/2016 | Tom | 87 | |
12 | Grapes | 10/7/2016 | Bob | 22 | |
13 | Grapes | 10/7/2016 | Bob | 12 | |
14 | Grapes | 10/7/2016 | Tony | 15 |
SELECT fruit,
name,
SUM(number) AS sum_num
FROM fruits_data
GROUP BY fruit, name;
fruits_data[:,
{"sum_num": dt.sum(f.Number)},
by('Fruit', 'Name')]
Fruit | Name | sum_num | ||
---|---|---|---|---|
str32 | str32 | int64 | ||
0 | Apples | Bob | 16 | |
1 | Apples | Mike | 9 | |
2 | Apples | Steve | 10 | |
3 | Grapes | Bob | 35 | |
4 | Grapes | Tom | 87 | |
5 | Grapes | Tony | 15 | |
6 | Oranges | Bob | 67 | |
7 | Oranges | Mike | 57 | |
8 | Oranges | Tom | 15 | |
9 | Oranges | Tony | 1 |
WHERE with GROUP BY¶
SELECT species,
AVG(sepal_length) AS avg_sepal_length
FROM iris
WHERE sepal_width > 3
GROUP BY species;
iris[f.sepal_width >=3, :][:,
{"avg_sepal_length": dt.mean(f.sepal_length)},
by('species')]
species | avg_sepal_length | ||
---|---|---|---|
str32 | float64 | ||
0 | setosa | 5.02917 | |
1 | versicolor | 6.21875 | |
2 | virginica | 6.76897 |
HAVING with GROUP BY¶
SELECT fruit,
name,
SUM(number) AS sum_num
FROM fruits_data
GROUP BY fruit, name
HAVING sum_num > 50;
fruits_data[:,
{'sum_num': dt.sum(f.Number)},
by('Fruit','Name')][f.sum_num > 50, :]
Fruit | Name | sum_num | ||
---|---|---|---|---|
str32 | str32 | int64 | ||
0 | Grapes | Tom | 87 | |
1 | Oranges | Bob | 67 | |
2 | Oranges | Mike | 57 |
Grouping on a condition¶
SELECT sepal_width >=3 AS width_larger_than_3,
AVG(sepal_length) AS avg_sepal_length
FROM iris
GROUP BY sepal_width>=3;
iris[:,
{"avg_sepal_length": dt.mean(f.sepal_length)},
by(f.sepal_width >= 3)]
C0 | avg_sepal_length | ||
---|---|---|---|
bool8 | float64 | ||
0 | 0 | 5.95263 | |
1 | 1 | 5.77634 |
At the moment, names cannot be assigned in the by
section.
LEFT OUTER JOIN¶
We will compare the left outer join, as that is the only join currently
implemented in datatable
. Another aspect is that the frame being joined
must be keyed, the column or columns to be keyed must not have duplicates,
and the joining column has to have the same name in both frames. You can read
more about the join()
API and have a look at the join(…).
Example data:
DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
y = [1, 3, 6] * 3,
v = range(1, 10))
X = dt.Frame({"x":('c','b'),
"v":(8,7),
"foo":(4,2)})
A left outer join in SQL:
SELECT DT.x,
DT.y,
DT.v,
X.foo
FROM DT
left JOIN X
ON DT.x = X.x
A left outer join in datatable
:
X.key = 'x'
DT[:, [f.x, f.y, f.v, g.foo], join(X)]
x | y | v | foo | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | b | 1 | 1 | 2 | |
1 | b | 3 | 2 | 2 | |
2 | b | 6 | 3 | 2 | |
3 | a | 1 | 4 | NA | |
4 | a | 3 | 5 | NA | |
5 | a | 6 | 6 | NA | |
6 | c | 1 | 7 | 4 | |
7 | c | 3 | 8 | 4 | |
8 | c | 6 | 9 | 4 |
UNION¶
The UNION ALL
clause in SQL can be replicated in datatable
with
rbind()
.
SELECT x, v
FROM DT
UNION ALL
SELECT x, v
FROM x
In datatable
, rbind()
takes a list/tuple of frames and lumps into one:
dt.rbind([DT[:, ('x','v')], X[:, ('x', 'v')]])
x | v | ||
---|---|---|---|
str32 | int32 | ||
0 | b | 1 | |
1 | b | 2 | |
2 | b | 3 | |
3 | a | 4 | |
4 | a | 5 | |
5 | a | 6 | |
6 | c | 7 | |
7 | c | 8 | |
8 | c | 9 | |
9 | b | 7 | |
10 | c | 8 |
SQL’s UNION
removes duplicate rows after combining the results of the
individual queries; there is no built-in function in datatable
yet that
handles duplicates.
SQL’s WINDOW functions¶
Some SQL window functions can be replicated in datatable
(rank is one of the windows function not currently implemented in datatable) :
TOP n rows per group
SELECT * from
(SELECT *,
ROW_NUMBER() OVER(PARTITION BY species ORDER BY sepal_length DESC) AS row_num
FROM iris)
WHERE row_num < 3;
iris[:3, :, by('species'), sort(-f.sepal_length)]
species | sepal_length | sepal_width | petal_length | petal_width | ||
---|---|---|---|---|---|---|
str32 | float64 | float64 | float64 | float64 | ||
0 | setosa | 5.8 | 4 | 1.2 | 0.2 | |
1 | setosa | 5.7 | 4.4 | 1.5 | 0.4 | |
2 | setosa | 5.7 | 3.8 | 1.7 | 0.3 | |
3 | versicolor | 7 | 3.2 | 4.7 | 1.4 | |
4 | versicolor | 6.9 | 3.1 | 4.9 | 1.5 | |
5 | versicolor | 6.8 | 2.8 | 4.8 | 1.4 | |
6 | virginica | 7.9 | 3.8 | 6.4 | 2 | |
7 | virginica | 7.7 | 3.8 | 6.7 | 2.2 | |
8 | virginica | 7.7 | 2.6 | 6.9 | 2.3 |
Filter for rows above the mean sepal length:
SELECT sepal_length,
sepal_width,
petal_length,
petal_width,
species
FROM
(SELECT *,
AVG(sepal_length) OVER (PARTITION BY species) AS avg_sepal_length
FROM iris)
WHERE sepal_length > avg_sepal_length
LIMIT 5;
iris[:,
update(temp = f.sepal_length > dt.mean(f.sepal_length)),
by('species')]
iris[f.temp == 1, f[:-1]].head(5)
sepal_length | sepal_width | petal_length | petal_width | species | ||
---|---|---|---|---|---|---|
float64 | float64 | float64 | float64 | str32 | ||
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | |
1 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | |
2 | 5.4 | 3.7 | 1.5 | 0.2 | setosa | |
3 | 5.8 | 4 | 1.2 | 0.2 | setosa | |
4 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
Lead and lag
SELECT name,
destination,
dep_date,
LEAD(dep_date) OVER (ORDER BY dep_date, name) AS lead1,
LEAD(dep_date, 2) OVER (ORDER BY dep_date, name) AS lead2,
LAG(dep_date) OVER (ORDER BY dep_date, name) AS lag1,
LAG(dep_date, 3) OVER (ORDER BY dep_date, name) AS lag3
FROM source_data;
source_data = dt.Frame({'name': ['Ann', 'Ann', 'Ann', 'Bob', 'Bob'],
'destination': ['Japan', 'Korea', 'Switzerland',
'USA', 'Switzerland'],
'dep_date': ['2019-02-02', '2019-01-01',
'2020-01-11', '2019-05-05',
'2020-01-11'],
'duration': [7, 21, 14, 10, 14]})
source_data[:,
f[:].extend({"lead1": dt.shift(f.dep_date, -1),
"lead2": dt.shift(f.dep_date, -2),
"lag1": dt.shift(f.dep_date),
"lag3": dt.shift(f.dep_date,3)
}),
sort('dep_date','name')]
name | destination | dep_date | duration | lead1 | lead2 | lag1 | lag3 | ||
---|---|---|---|---|---|---|---|---|---|
str32 | str32 | str32 | int32 | str32 | str32 | str32 | str32 | ||
0 | Ann | Korea | 2019-01-01 | 21 | 2019-02-02 | 2019-05-05 | NA | NA | |
1 | Ann | Japan | 2019-02-02 | 7 | 2019-05-05 | 2020-01-11 | 2019-01-01 | NA | |
2 | Bob | USA | 2019-05-05 | 10 | 2020-01-11 | 2020-01-11 | 2019-02-02 | NA | |
3 | Ann | Switzerland | 2020-01-11 | 14 | 2020-01-11 | NA | 2019-05-05 | 2019-01-01 | |
4 | Bob | Switzerland | 2020-01-11 | 14 | NA | NA | 2020-01-11 | 2019-02-02 |
The equivalent of SQL’s LAG
is shift()
with a positive number,
while SQL’s LEAD
is shift()
with a negative number.
Note
datatable
does not natively support datetimes yet.
Total sum and the proportions:
proportions = dt.Frame({"t": [1, 2, 3]})
proportions
t | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 |
SELECT t,
SUM(t) OVER () AS sum,
CAST(t as FLOAT)/SUM(t) OVER () AS pct
FROM proportions;
proportions[:,
f[:].extend({"sum": dt.sum(f.t),
"pct": f.t/dt.sum(f.t)})]
t | sum | pct | ||
---|---|---|---|---|
int32 | int64 | float64 | ||
0 | 1 | 6 | 0.166667 | |
1 | 2 | 6 | 0.333333 | |
2 | 3 | 6 | 0.5 |
Dates and time¶
datatable
has several builtin types to support working with date/time
variables.
date32¶
The date32
type is used to represent a particular calendar date without a
time component. Internally, this type is stored as a 32-bit integer containing
the number of days since the epoch (Jan 1, 1970). Thus, this type accommodates
dates within the range of approximately ±5.8 million years.
The calendar used for this type is proleptic Gregorian, meaning that it extends the modern-day Gregorian calendar into the past before this calendar was first adopted.
time64¶
The time64
type is used to represent a specific moment in time. This
corresponds to datetime
in Python, or timestamp
in Arrow or pandas.
Internally, this type is stored as a 64-bit integer containing the number of
milliseconds since the epoch (Jan 1, 1970) in UTC.
This type is not leap-seconds aware, meaning that it assumes that each day
has exactly 24×3600 seconds. In practice it means that calculating time
difference between two time64
moments may be off by the number of leap
seconds that have occurred between them.
A time64
column may also carry a time zone as meta information. This time
zone is used to convert the timestamp from the absolute UTC time to the local
calendar. For example, suppose you have two time64
columns: one is in UTC
while the other is in America/Los_Angeles time zone. Assume both columns
store the same value 1577836800000
. Then these two columns represent the
same moment in time, however their calendar representations are different:
2020-01-01T00:00:00Z
and 2019-12-31T16:00:00-0800
respectively.
FTRL Model¶
This section provides a brief introduction to the FTRL (Follow the Regularized Leader)
model as implemented in datatable. For detailed information on API,
please refer to the Ftrl
Python class documentation.
FTRL Model Information¶
The Follow the Regularized Leader (FTRL) model is a datatable implementation of the FTRL-Proximal online learning algorithm for binomial logistic regression. It uses a hashing trick for feature vectorization and the Hogwild approach for parallelization. FTRL for multinomial classification and continuous targets are implemented experimentally.
Create an FTRL Model¶
The FTRL model is implemented as the Ftrl
Python class, which is a
part of dt.models
, so to use the model you should first do:
from datatable.models import Ftrl
and then create a model as:
ftrl_model = Ftrl()
FTRL Model Parameters¶
The FTRL model requires a list of parameters for training and making predictions, namely:
alpha
– learning rate, defaults to0.005
.beta
– beta parameter, defaults to1.0
.lambda1
– L1 regularization parameter, defaults to0.0
.lambda2
– L2 regularization parameter, defaults to1.0
.nbins
– the number of bins for the hashing trick, defaults to10**6
.mantissa_nbits
– the number of bits from mantissa to be used for hashing, defaults to10
.nepochs
– the number of epochs to train the model for, defaults to1
.negative_class
– whether to create and train on a “negative” class in the case of multinomial classification, defaults toFalse
.interactions
— a list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. This setting defaults toNone
.model_type
— training mode that can be one of the following:"auto"
to automatically set model type based on the target column data,"binomial"
for binomial classification,"multinomial"
for multinomial classification or"regression"
for continuous targets. Defaults to"auto"
.
If some parameters need to be changed from their default values, this can be done either when creating the model, as
ftrl_model = Ftrl(alpha = 0.1, nbins = 100)
or, if the model already exists, as
ftrl_model.alpha = 0.1
ftrl_model.nbins = 100
If some parameters were not set explicitely, they will be assigned the default values.
Training a Model¶
Use the fit()
method to train a model:
ftrl_model.fit(X_train, y_train)
where X_train
is a frame of shape (nrows, ncols)
to be trained on,
and y_train
is a target frame of shape (nrows, 1)
. The following
datatable column types are supported for the X_train
frame: bool
,
int
, real
and str
.
FTRL model can also do early stopping, if relative validation error does not improve. For this the model should be fit as
res = ftrl_model.fit(X_train, y_train, X_validation, y_validation,
nepochs_validation, validation_error,
validation_average_niterations)
where X_train
and y_train
are training and target frames,
respectively, X_validation
and y_validation
are validation frames,
nepochs_validation
specifies how often, in epoch units, validation
error should be checked, validation_error
is the relative
validation error improvement that the model should demonstrate within
nepochs_validation
to continue training, and
validation_average_niterations
is the number of iterations
to average when calculating the validation error. Returned res
tuple contains epoch at which training stopped and the corresponding loss.
Resetting a Model¶
Use the reset()
method to reset a model:
ftrl_model.reset()
This will reset model weights, but it will not affect learning parameters. To reset parameters to default values, you can do
ftrl_model.params = Ftrl().params
Making Predictions¶
Use the predict()
method to make predictions:
targets = ftrl_model.predict(X)
where X
is a frame of shape (nrows, ncols)
to make predictions for.
X
should have the same number of columns as the training frame.
The predict()
method returns a new frame of shape (nrows, 1)
with
the predicted probability for each row of frame X
.
Feature Importances¶
To estimate feature importances, the overall weight contributions are calculated feature-wise during training and predicting. Feature importances can be accessed as
fi = ftrl_model.feature_importances
where fi
will be a frame of shape (nfeatures, 2)
containing
feature names and their importances, that are normalized to [0; 1] range.
Feature Interactions¶
By default each column of a training dataset is considered as a feature by FTRL model. User can provide additional features by specifying a list or a tuple of feature interactions, for instance as
ftrl_model.interactions = [["C0", "C1", "C3"], ["C2", "C5"]]
where C*
are column names from a training dataset. In the above example
two additional features, namely, C0:C1:C3
and C2:C5
, are created.
interactions
should be set before a call to fit()
method, and can not be
changed once the model is trained.
datatable API¶
Symbols listed here are available for import from the root of the datatable
module.
Submodules¶
|
|
Access to some internal details of |
|
Mathematical functions, similar to python’s |
|
A small set of data analysis tools. |
|
Functions using regular expressions. |
|
Functions for working with string columns. |
|
Functions for working with date/time columns. |
Classes¶
Main “table of data” class. This is the equivalent of pandas’ or Julia’s
|
|
Helper class for computing formulas over a frame. |
|
Helper class for addressing columns in a frame. |
|
Column’s type, similar to numpy’s |
|
[DEPRECATED] Enum of column “storage” types. |
|
[DEPRECATED] Enum of column “logical” types. |
Functions¶
Read CSV/text/XLSX/Jay/other files |
|
Same as |
|
Group-by clause for use in Frame’s square-bracket selector |
|
Join clause for use in Frame’s square-bracket selector |
|
Sort clause for use in Frame’s square-bracket selector |
|
Create new or update existing columns within a frame |
|
Combine frames by columns |
|
Combine frames by rows |
|
Concatenate frame by rows |
|
Cast column into another type |
|
Ternary if operator |
|
Shift column by a given number of rows |
|
Bin a column into equal-width intervals |
|
Bin a column into equal-population intervals |
|
[DEPRECATED] Split and nhot-encode a single-column frame |
|
Inject datatable’s stylesheets into the Jupyter notebook |
|
Row-wise all() function |
|
Row-wise any() function |
|
Calculate the number of non-missing values per row |
|
Find the first non-missing value row-wise |
|
Find the last non-missing value row-wise |
|
Find the largest element row-wise |
|
Calculate the mean value row-wise |
|
Find the smallest element row-wise |
|
Calculate the standard deviation row-wise |
|
Calculate the sum of all values row-wise |
|
Calculate the set intersection of values in the frames |
|
Calculate the set difference between the frames |
|
Calculate the symmetric difference between the sets of values in the frames |
|
Calculate the union of values in the frames |
|
Find unique values in a frame |
|
Calculate correlation between two columns |
|
Count non-missing values per a column |
|
Calculate covariance between two columns |
|
Find the largest element per a column |
|
Calculate mean value per a column |
|
Find the median element per a column |
|
Find the smallest element per a column |
|
Calculate the standard deviation per a column |
|
Calculate the sum of all values per a column |
Other¶
Information about the build of the |
|
The |
|
The primary namespace used during |
|
Secondary namespace used during |
|
|
datatable.exceptions¶
This module contains warnings and exceptions that datatable
may
throw during runtime.
Exceptions¶
Exceptions are thrown when a special condition, that is unexpected,
encountered during runtime. All datatable
exceptions are
descendants of DtException
class, so that they can be easily catched.
The following exceptions may be thrown:
Equivalient to the built-in |
|
Equivalient to the built-in |
|
The operation requested is illegal for the given combination of parameters. |
|
Equivalient to the built-in |
|
Equivalient to the built-in |
|
Equivalient to the built-in |
|
Equivalient to the built-in |
|
Equivalient to the built-in |
|
Equivalient to the built-in |
|
Equivalient to the built-in |
Warnings¶
Warnings are issued when it is helpful to inform the user of some condition in a program, that doesn’t result in an exception and the program termination. We may issue the following warnings:
A built-in python warning about deprecated features. |
|
A |
|
A warning regarding the input/output operation. |
datatable.exceptions.DtException¶
Base class for all exceptions raised by datatable
.
datatable.exceptions.ImportError¶
This exception may be raised when a datatable operation requires an external module or library, but that module is not available. Examples of such operations include: converting a Frame into a pandas DataFrame, or into an Arrow Table, or reading an Excel file.
Inherits from Python ImportError
and dt.exceptions.DtException
.
datatable.exceptions.InvalidOperationError¶
Raised in multiple scenarios whenever the requested operation is logically invalid with the given combination of parameters.
For example, cbind
-ing several frames with
incompatible shapes.
Inherits from dt.exceptions.DtException
.
datatable.exceptions.IndexError¶
Raised when accessing an element of a frame by index, but the value of the index falls outside of the boundaries of the frame.
Inherits from Python IndexError
and dt.exceptions.DtException
.
datatable.exceptions.IOError¶
Raised during any IO operation, such as reading/writing CSV or Jay files. The most common cause for such an error is an invalid input file.
Inherits from Python IOError
and dt.exceptions.DtException
.
datatable.exceptions.IOWarning¶
This warning is raised whenever you read an input file and there are some irregularities in the input that we can recover from, but perhaps the user should be informed that something wasn’t quite right.
datatable.exceptions.KeyError¶
Raised when accessing a column of a frame by name, but the name lookup fails to find such a column.
Inherits from Python KeyError
and dt.exceptions.DtException
.
datatable.exceptions.MemoryError¶
This exception is raised whenever any operation fails to allocate the required amount of memory.
Inherits from Python MemoryError
and dt.exceptions.DtException
.
datatable.exceptions.NotImplementedError¶
Raised whenever an operation with given parameter values or input types is in theory valid, but hasn’t been implemented yet.
Inherits from Python NotImplementedError
and dt.exceptions.DtException
.
datatable.exceptions.OverflowError¶
Rare error that may occur if you pass a parameter that is too large to fit
into C++ int64
type, or sometimes larger than a double
.
Inherits from Python OverflowError
and dt.exceptions.DtException
.
datatable.exceptions.TypeError¶
One of the most common exceptions raised by datatable
, this occurs
when either a function receives an argument of unexpected type, or incorrect
number of arguments, or whenever an operation is requested on a column whose
type is not suitable for that operation.
Inherits from Python TypeError
and dt.exceptions.DtException
.
datatable.exceptions.ValueError¶
Very common exception that occurs whenever an argument is passed to a function and that argument has the correct type, yet the value is not valid.
Inherits from Python ValueError
and dt.exceptions.DtException
.
datatable.internal¶
Warning
The functions in this sub-module are considered to be “internal” and
not useful for day-to-day work with datatable
module.
C pointer to column’s data |
|
Indicators of which columns in the frame are virtual. |
|
Run checks on whether the frame’s state is corrupted. |
|
Get ids of threads spawned by datatable. |
datatable.internal.frame_column_data_r()¶
datatable.internal.frame_columns_virtual()¶
datatable.internal.frame_integrity_check()¶
This function performs a range of tests on the frame
to verify
that its internal state is consistent. It returns None on success,
or throws an AssertionError
if any problems were found.
datatable.internal.get_thread_ids()¶
Return system ids of all threads used internally by datatable.
Calling this function will cause the threads to spawn if they haven’t done already. (This behavior may change in the future).
Parameters¶
List[str]
The list of thread ids used by the datatable. The first element in the list is the id of the main thread.
See Also¶
dt.options.nthreads
– global option that controls the number of threads in use.
datatable.math¶
Trigonometric functions¶
Compute \(\sin x\) (the trigonometric sine of |
|
Compute \(\cos x\) (the trigonometric cosine of |
|
Compute \(\tan x\) (the trigonometric tangent of |
|
Compute \(\sin^{-1} x\) (the inverse sine of |
|
Compute \(\cos^{-1} x\) (the inverse cosine of |
|
Compute \(\tan^{-1} x\) (the inverse tangent of |
|
Compute \(\tan^{-1} (x/y)\). |
|
Compute \(\sqrt{x^2 + y^2}\). |
|
Convert an angle measured in degrees into radians. |
|
Convert an angle measured in radians into degrees. |
Hyperbolic functions¶
Compute \(\sinh x\) (the hyperbolic sine of |
|
Compute \(\cosh x\) (the hyperbolic cosine of |
|
Compute \(\tanh x\) (the hyperbolic tangent of |
|
Compute \(\sinh^{-1} x\) (the inverse hyperbolic sine of |
|
Compute \(\cosh^{-1} x\) (the inverse hyperbolic cosine of |
|
Compute \(\tanh^{-1} x\) (the inverse hyperbolic tangent of |
Exponential/logarithmic functions¶
Compute \(e^x\) (the exponent of |
|
Compute \(2^x\). |
|
Compute \(e^x - 1\). |
|
Compute \(\ln x\) (the natural logarithm of |
|
Compute \(\log_{10} x\) (the decimal logarithm of |
|
Compute \(\ln(1 + x)\). |
|
Compute \(\log_{2} x\) (the binary logarithm of |
|
Compute \(\ln(e^x + e^y)\). |
|
Compute \(\log_2(2^x + 2^y)\). |
|
Compute \(\sqrt[3]{x}\) (the cubic root of |
|
Compute \(x^a\). |
|
Compute \(\sqrt{x}\) (the square root of |
|
Compute \(x^2\) (the square of |
Special mathemetical functions¶
Floating-point functions¶
Absolute value of |
|
The smallest integer not less than |
|
Number with the magnitude of |
|
The absolute value of |
|
The largest integer not greater than |
|
Remainder of a floating-point division |
|
Check whether |
|
Check if |
|
Check if |
|
Check if |
|
Compute \(x\cdot 2^y\). |
|
Round |
|
The sign of |
|
The sign of |
|
The value of |
Mathematical constants¶
Comparison table¶
The set of functions provided by the dt.math
module is very
similar to the standard Python’s math
module, or
numpy math functions. Below is the comparison table showing which functions
are available:
math |
numpy |
datatable |
---|---|---|
Trigonometric/hyperbolic functions |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Exponential/logarithmic/power functions |
||
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
||
|
||
|
|
|
|
|
|
|
||
Special mathematical functions |
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Floating-point functions |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
||
|
|
|
|
||
|
||
|
||
|
|
|
Miscellaneous |
||
|
||
|
||
|
||
|
||
|
|
|
|
||
|
||
Mathematical constants |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
datatable.math.abs()¶
Return the absolute value of x
. This function can only be applied
to numeric arguments (i.e. boolean, integer, or real).
This function upcasts columns of types bool8
, int8
and int16
into
int32
; for columns of other types the stype is kept.
Parameters¶
FExpr
Column expression producing one or more numeric columns.
FExpr
The resulting FExpr evaluates absolute values in all elements
in all columns of x
.
Examples¶
DT = dt.Frame(A=[-3, 2, 4, -17, 0])
DT[:, abs(f.A)]
A | ||
---|---|---|
int32 | ||
0 | 3 | |
1 | 2 | |
2 | 4 | |
3 | 17 | |
4 | 0 |
datatable.math.arccos()¶
Inverse trigonometric cosine of x
.
In mathematics, this may be written as \(\arccos x\) or \(\cos^{-1}x\).
The returned value is in the interval \([0, \frac12\tau]\),
and NA for the values of x
that lie outside the interval
[-1, 1]
. This function is the inverse of
cos()
in the sense that
cos(arccos(x)) == x
for all x
in the interval [-1, 1]
.
datatable.math.arcosh()¶
datatable.math.arcsin()¶
Inverse trigonometric sine of x
.
In mathematics, this may be written as \(\arcsin x\) or \(\sin^{-1}x\).
The returned value is in the interval \([-\frac14 \tau, \frac14\tau]\),
and NA for the values of x
that lie outside the interval [-1, 1]
.
This function is the inverse of sin()
in the sense
that sin(arcsin(x)) == x
for all x
in the interval [-1, 1]
.
datatable.math.arctan()¶
Inverse trigonometric tangent of x
.
This function satisfies the property that tan(arctan(x)) == x
.
See also¶
atan2(x, y)
– two-argument inverse tangent function;tan(x)
– the trigonometric tangent function.
datatable.math.arsinh()¶
datatable.math.artanh()¶
datatable.math.atan2()¶
The inverse trigonometric tangent of y/x
, taking into account the signs
of x
and y
to produce the correct result.
If (x,y)
is a point in a Cartesian plane, then arctan2(y, x)
returns
the radian measure of an angle formed by two rays: one starting at the origin
and passing through point (0,1)
, and the other starting at the origin
and passing through point (x,y)
. The angle is assumed positive if the
rotation from the first ray to the second occurs counter-clockwise, and
negative otherwise.
As a special case, arctan2(0, 0) == 0
, and arctan2(0, -1) == tau/2
.
datatable.math.cbrt()¶
Cubic root of x.
datatable.math.ceil()¶
The smallest integer value not less than x
, returned as float.
This function produces a float32
column if the input is of type
float32
, or float64
columns for inputs of all other numeric
stypes.
datatable.math.copysign()¶
datatable.math.cos()¶
Compute the trigonometric cosine of angle x
measured in radians.
This function can only be applied to numeric columns (real, integer, or
boolean), and produces a float64 result, except when the argument x
is
float32, in which case the result is float32 as well.
datatable.math.cosh()¶
datatable.math.deg2rad()¶
Convert angle measured in degrees into radians: \(\operatorname{deg2rad}(x) = x\cdot\frac{\tau}{360}\).
See also¶
rad2deg(x)
– convert radians into degrees.
datatable.math.e¶
datatable.math.erf()¶
datatable.math.erfc()¶
Complementary error function erfc(x) = 1 - erf(x)
.
The complementary error function is defined as the integral
Although mathematically erfc(x) = 1-erf(x)
, in practice the RHS
suffers catastrophic loss of precision at large values of x
. This
function, however, does not have such a drawback.
datatable.math.exp()¶
datatable.math.exp2()¶
datatable.math.expm1()¶
datatable.math.floor()¶
The largest integer value not greater than x
, returned as float.
This function produces a float32
column if the input is of type
float32
, or float64
columns for inputs of all other numeric
stypes.
datatable.math.fmod()¶
Floating-point remainder of the division x/y. The result is always
a float, even if the arguments are integers. This function uses
std::fmod()
from the standard C++ library, its convention for
handling of negative numbers may be different than the Python’s.
datatable.math.gamma()¶
Euler Gamma function of x.
The gamma function is defined for all x
except for the negative
integers. For positive x
it can be computed via the integral
For negative x
it can be computed as
where \(k\) is any integer such that \(x+k\) is positive.
If x
is a positive integer, then \(\Gamma(x) = (x - 1)!\).
datatable.math.golden¶
The golden ratio \(\varphi = (1 + \sqrt{5})/2\), also known as golden section. This is a number such that if \(a = \varphi b\), for some non-zero \(a\) and \(b\), then it must also be true that \(a + b = \varphi a\).
The constant is stored with float64
precision, and its value is
1.618033988749895
.
datatable.math.hypot()¶
datatable.math.inf¶
Number representing positive infinity \(\infty\). Write -inf
for
negative infinity.
datatable.math.isclose()¶
Compare two numbers x and y, and return True if they are close within the requested relative/absolute tolerance. This function only returns True/False, never NA.
More specifically, isclose(x, y) is True if either of the following are true:
x == y
(including the case when x and y are NAs),abs(x - y) <= atol + rtol * abs(y)
and neither x nor y are NA
The tolerance parameters rtol
, atol
must be positive floats,
and cannot be expressions.
datatable.math.isfinite()¶
datatable.math.isinf()¶
Returns True if the argument is +/- infinity, and False otherwise.
Note that isinf(NA) == False
.
datatable.math.isna()¶
Returns True
if the argument is NA, and False
otherwise.
datatable.math.ldexp()¶
Multiply x by 2 raised to the power y, i.e. compute x * 2**y
.
Column x is expected to be float, and y integer.
datatable.math.log()¶
datatable.math.log10()¶
Decimal (base-10) logarithm of x, which is \(\lg(x)\) or
\(\log_{10} x\). This function is the inverse of
pow(10, x)
.
datatable.math.log1p()¶
datatable.math.log2()¶
datatable.math.logaddexp()¶
The logarithm of the sum of exponents of x and y. This function is
equivalent to log(exp(x) + exp(y))
, but does not suffer from
catastrophic precision loss for small values of x and y.
datatable.math.logaddexp2()¶
Binary logarithm of the sum of binary exponents of x and y. This
function is equivalent to log2(exp2(x) + exp2(y))
, but does
not suffer from catastrophic precision loss for small values of
x and y.
datatable.math.nan¶
Not-a-number, a special floating-point constant that denotes a missing
number. In most datatable functions you can use None
instead
of nan
.
datatable.math.pi¶
Mathematical constant \(\pi = \frac12\tau\), also known as Archimedes’ constant, equal to the length of a semicircle with radius 1, or equivalently the arc-length of a \(180^\circ\) angle [1].
The constant is stored at float64
precision, and its value is
3.141592653589793
.
datatable.math.pow()¶
Number x raised to the power y. The return value will be float, even if the arguments x and y are integers.
This function is equivalent to x ** y
.
datatable.math.rad2deg()¶
Convert angle measured in radians into degrees: \(\operatorname{rad2deg}(x) = x\cdot\frac{360}{\tau}\).
See also¶
deg2rad(x)
– convert degrees into radians.
datatable.math.round()¶
Round the values in cols
up to the specified number of the digits
of precision ndigits
. If the number of digits is omitted, rounds
to the nearest integer.
Generally, this operation is equivalent to:
rint(col * 10**ndigits) / 10**ndigits
where function rint()
rounds to the nearest integer.
Parameters¶
FExpr
Input data for rounding. This could be an expression yielding
either a single or multiple columns. The round()
function will
apply to each column independently and produce as many columns
in the output as there were in the input.
Only numeric columns are allowed: boolean, integer or float.
An exception will be raised if cols
contains a non-numeric
column.
int
| None
The number of precision digits to retain. This parameter could be either positive or negative (or None). If positive then it gives the number of digits after the decimal point. If negative, then it rounds the result up to the corresponding power of 10.
For example, 123.45
rounded to ndigits=1
is 123.4
, whereas
rounded to ndigits=-1
it becomes 120.0
.
FExpr
f-expression that rounds the values in its first argument to the specified number of precision digits.
Each input column will produce the column of the same stype in
the output; except for the case when ndigits
is None
and
the input is either float32
or float64
, in which case an
int64
column is produced (similarly to python’s round()
).
Notes¶
Values that are exactly half way in between their rounded neighbors
are converted towards their nearest even value. For example, both
7.5
and 8.5
are rounded into 8
, whereas 6.5
is rounded as 6
.
Rounding integer columns may produce unexpected results for values
that are close to the min/max value of that column’s storage type.
For example, when an int8
value 127
is rounded to nearest 10
, it
becomes 130
. However, since 130
cannot be represented as int8
a
wrap-around occurs and the result becomes -126
.
Rounding an integer column to a positive ndigits
is a noop: the
column will be returned unchanged.
Rounding an integer column to a large negative ndigits
will produce
a constant all-0 column.
datatable.math.sign()¶
datatable.math.signbit()¶
datatable.math.sin()¶
Compute the trigonometric sine of angle x
measured in radians.
This function can only be applied to numeric columns (real, integer, or
boolean), and produces a float64 result, except when the argument x
is
float32, in which case the result is float32 as well.
datatable.math.sinh()¶
datatable.math.sqrt()¶
The square root of x, same as x ** 0.5
.
datatable.math.square()¶
The square of x, same as x ** 2.0
. As with all other math
functions, the result is floating-point, even if the argument
x is integer.
datatable.math.tan()¶
Compute the trigonometric tangent of x
, which is the ratio
sin(x)/cos(x)
.
This function can only be applied to numeric columns (real, integer, or
boolean), and produces a float64 result, except when the argument x
is
float32, in which case the result is float32 as well.
datatable.math.tanh()¶
datatable.math.tau¶
Mathematical constant \(\tau\), also known as a turn, equal to the circumference of a circle with a unit radius.
The constant is stored at float64
precision, and its value is 6.283185307179586
.
datatable.math.trunc()¶
The nearest integer value not greater than x
in magnitude.
If x is integer or boolean, then trunc() will return this value converted to float64. If x is floating-point, then trunc(x) acts as floor(x) for positive values of x, and as ceil(x) for negative values of x. This rounding mode is known as rounding towards zero.
datatable.models¶
Classes¶
FTRL-Proximal online learning model. |
|
Linear model with stohastic gradient descent learning. |
Functions¶
Aggregate a frame. |
|
Perform k-fold split. |
|
Perform randomized k-fold split. |
datatable.models.Ftrl¶
This class implements the Follow the Regularized Leader (FTRL) model, that is based on the FTRL-Proximal online learning algorithm for binomial logistic regression. Multinomial classification and regression for continuous targets are also implemented, though these implementations are experimental. This model is fully parallel and is based on the Hogwild approach for parallelization.
The model supports numerical (boolean, integer and float types), temporal (date and time types) and string features. To vectorize features a hashing trick is employed, such that all the values are hashed with the 64-bit hashing function. This function is implemented as follows:
for booleans and integers the hashing function is essentially an identity function;
for floats the hashing function trims mantissa, taking into account
mantissa_nbits
, and interprets the resulting bit representation as a 64-bit unsigned integer;for date and time types the hashing function is essentially an identity function that is based on their internal integer representations;
for strings the 64-bit Murmur2 hashing function is used.
To compute the final hash x
the Murmur2 hashed feature name is added
to the hashed feature and the result is modulo divided by the number of
requested bins, i.e. by nbins
.
For each hashed row of data, according to Ad Click Prediction: a View from the Trenches, the following FTRL-Proximal algorithm is employed:

When trained, the model can be used to make predictions, or it can be re-trained on new datasets as many times as needed improving model weights from run to run.
Methods¶
Properties¶
\(\alpha\) in per-coordinate FTRL-Proximal algorithm. |
|
\(\beta\) in per-coordinate FTRL-Proximal algorithm. |
|
Column names of the training frame, i.e. features. |
|
Hashes of the column names. |
|
An option to control precision of the internal computations. |
|
Feature importances calculated during training. |
|
Feature interactions. |
|
Classification labels. |
|
L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm. |
|
L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm. |
|
Number of mantissa bits for hashing floats. |
|
The model’s |
|
A model type |
|
A model type |
|
Number of bins for the hashing trick. |
|
An option to indicate if the “negative” class should be a created for multinomial classification. |
|
Number of training epochs. |
|
All the input model parameters as a named tuple. |
datatable.models.Ftrl.__init__()¶
Create a new Ftrl
object.
float
\(\alpha\) in per-coordinate FTRL-Proximal algorithm, should be positive.
float
\(\beta\) in per-coordinate FTRL-Proximal algorithm, should be non-negative.
float
L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.
float
L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.
int
Number of bins to be used for the hashing trick, should be positive.
int
Number of mantissa bits to take into account when hashing floats.
It should be non-negative and less than or equal to 52
, that
is a number of mantissa bits allocated for a C++ 64-bit double
.
float
Number of training epochs, should be non-negative. When nepochs
is an integer number, the model will train on all the data
provided to .fit()
method nepochs
times. If nepochs
has a fractional part {nepochs}
, the model will train on all
the data [nepochs]
times, i.e. the integer part of nepochs
.
Plus, it will also perform an additional training iteration
on the {nepochs}
fraction of data.
bool
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be used
for computations. It is not guaranteed that setting
double_precision
to True
will automatically improve
the model accuracy. It will, however, roughly double the memory
footprint of the Ftrl
object.
bool
An option to indicate if a “negative” class should be created
in the case of multinomial classification. For the “negative”
class the model will train on all the negatives, and if
a new label is encountered in the target column, its
weights will be initialized to the current “negative” class weights.
If negative_class
is set to False
, the initial weights
become zeros.
List[List[str] | Tuple[str]]
| Tuple[List[str] | Tuple[str]]
A list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. Each interaction should have at least one feature.
"binomial"
| "multinomial"
| "regression"
| "auto"
The model type to be built. When this option is "auto"
then the model type will be automatically chosen based on
the target column stype
.
FtrlParams
Named tuple of the above parameters. One can pass either this tuple, or any combination of the individual parameters to the constructor, but not both at the same time.
ValueError
The exception is raised if both the params
and one of the
individual model parameters are passed at the same time.
datatable.models.Ftrl.alpha¶
\(\alpha\) in per-coordinate FTRL-Proximal algorithm.
float
Current alpha
value.
float
New alpha
value, should be positive.
ValueError
The exception is raised when new_alpha
is not positive.
datatable.models.Ftrl.beta¶
\(\beta\) in per-coordinate FTRL-Proximal algorithm.
float
Current beta
value.
float
New beta
value, should be non-negative.
ValueError
The exception is raised when new_beta
is negative.
datatable.models.Ftrl.colnames¶
Column names of the training frame, i.e. the feature names.
List[str]
A list of the column names.
.colname_hashes
– the hashed column names.
datatable.models.Ftrl.colname_hashes¶
datatable.models.Ftrl.double_precision¶
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be
used for computations. This option is read-only and can only be set
during the Ftrl
object construction
.
bool
Current double_precision
value.
datatable.models.Ftrl.feature_importances¶
Feature importances as calculated during the model training and
normalized to [0; 1]
. The normalization is done by dividing
the accumulated feature importances over the maximum value.
Frame
A frame with two columns: feature_name
that has stype str32
,
and feature_importance
that has stype float32
or float64
depending on whether the .double_precision
option is False
or True
.
datatable.models.Ftrl.fit()¶
Train model on the input samples and targets.
Frame
Training frame.
Frame
Target frame having as many rows as X_train
and one column.
Frame
Validation frame having the same number of columns as X_train
.
Frame
Validation target frame of shape (nrows, 1)
.
float
Parameter that specifies how often, in epoch units, validation error should be checked.
float
The improvement of the relative validation error that should be
demonstrated by the model within nepochs_validation
epochs,
otherwise the training will stop.
int
Number of iterations that is used to average the validation error.
Each iteration corresponds to nepochs_validation
epochs.
FtrlFitOutput
FtrlFitOutput
is a Tuple[float, float]
with two fields: epoch
and loss
,
representing the final fitting epoch and the final loss, respectively.
If validation dataset is not provided, the returned epoch
equals to
nepochs
and the loss
is just float('nan')
.
.predict()
– predict for the input samples..reset()
– reset the model.
datatable.models.Ftrl.interactions¶
The feature interactions to be used for model training. This option is read-only for a trained model.
Tuple
Current interactions
value.
List[List[str] | Tuple[str]]
| Tuple[List[str] | Tuple[str]]
New interactions
value. Each particular interaction
should be a list or a tuple of feature names, where each feature
name is a column name from the training frame.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
one of the interactions has zero features.
datatable.models.Ftrl.labels¶
Classification labels the model was trained on.
Frame
A one-column frame with the classification labels. In the case of numeric regression, the label is the target column name.
datatable.models.Ftrl.lambda1¶
L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm.
float
Current lambda1
value.
float
New lambda1
value, should be non-negative.
ValueError
The exception is raised when new_lambda1
is negative.
datatable.models.Ftrl.lambda2¶
L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm.
float
Current lambda2
value.
float
New lambda2
value, should be non-negative.
ValueError
The exception is raised when new_lambda2
is negative.
datatable.models.Ftrl.model¶
Trained models weights, i.e. z
and n
coefficients
in per-coordinate FTRL-Proximal algorithm.
datatable.models.Ftrl.model_type¶
A type of the model Ftrl
should build:
"binomial"
for binomial classification;"multinomial"
for multinomial classification;"regression"
for numeric regression;"auto"
for automatic model type detection based on the target columnstype
.
This option is read-only for a trained model.
str
Current model_type
value.
"binomial"
| "multinomial"
| "regression"
| "auto"
New model_type
value.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
new_model_type
value is not one of the following:"binomial"
,"multinomial"
,"regression"
or"auto"
.
.model_type_trained
– the model typeFtrl
has build.
datatable.models.Ftrl.model_type_trained¶
The model type Ftrl
has built.
str
Could be one of the following: "regression"
, "binomial"
,
"multinomial"
or "none"
for untrained model.
.model_type
– the model typeFtrl
should build.
datatable.models.Ftrl.mantissa_nbits¶
Number of mantissa bits to take into account for hashing floats. This option is read-only for a trained model.
int
Current mantissa_nbits
value.
int
New mantissa_nbits
value, should be non-negative and
less than or equal to 52
, that is a number of
mantissa bits in a C++ 64-bit double
.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
new_mantissa_nbits
value is negative or larger than52
.
datatable.models.Ftrl.nbins¶
Number of bins to be used for the hashing trick. This option is read-only for a trained model.
int
Current nbins
value.
int
New nbins
value, should be positive.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
new_nbins
value is not positive.
datatable.models.Ftrl.negative_class¶
An option to indicate if a “negative” class should be created
in the case of multinomial classification. For the “negative”
class the model will train on all the negatives, and if
a new label is encountered in the target column, its
weights are initialized to the current “negative” class weights.
If negative_class
is set to False
, the initial weights
become zeros.
This option is read-only for a trained model.
bool
Current negative_class
value.
bool
New negative_class
value.
ValueError
The exception is raised when trying to change this option for a model that has already been trained.
datatable.models.Ftrl.nepochs¶
Number of training epochs. When nepochs
is an integer number,
the model will train on all the data provided to .fit()
method
nepochs
times. If nepochs
has a fractional part {nepochs}
,
the model will train on all the data [nepochs]
times,
i.e. the integer part of nepochs
. Plus, it will also perform an additional
training iteration on the {nepochs}
fraction of data.
float
Current nepochs
value.
float
New nepochs
value, should be non-negative.
ValueError
The exception is raised when new_nepochs
value is negative.
datatable.models.Ftrl.params¶
Ftrl
model parameters as a named tuple FtrlParams
,
see .__init__()
for more details.
This option is read-only for a trained model.
FtrlParams
Current params
value.
FtrlParams
New params
value.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
individual parameter values are incompatible with the corresponding setters.
datatable.models.Ftrl.predict()¶
Predict for the input samples.
datatable.models.Ftrl.reset()¶
Reset Ftrl
model by resetting all the model weights, labels and
feature importance information.
None
.fit()
– train model on a dataset..predict()
– predict on a dataset.
datatable.models.LinearModel¶
This class implements the
Linear model
with the
stochastic gradient descent
learning. It supports linear regression, as well as binomial and multinomial
classification. Both .fit()
and .predict()
methods are fully parallel.
Construction¶
Construct a |
Methods¶
Train model on the input samples and targets. |
|
Report model status. |
|
Predict for the input samples. |
|
Reset the model. |
Properties¶
Initial learning rate. |
|
Decay for the |
|
Drop rate for the |
|
Learning rate schedule. |
|
An option to control precision of the internal computations. |
|
Classification labels. |
|
L1 regularization parameter. |
|
L2 regularization parameter. |
|
Model coefficients. |
|
Model type to be built. |
|
An option to indicate if the “negative” class should be a created for multinomial classification. |
|
Number of training epochs. |
|
All the input model parameters as a named tuple. |
|
Seed for the quasi-random data shuffling. |
datatable.models.LinearModel.__init__()¶
Create a new LinearModel
object.
float
The initial learning rate, should be positive.
float
Decay for the "time-based"
and "step-based"
learning rate schedules, should be non-negative.
float
Drop rate for the "step-based"
learning rate schedule,
should be positive.
"constant"
| "time-based"
| "step-based"
| "exponential"
Learning rate schedule. When it is "constant"
the learning rate
eta
is constant and equals to eta0
. Otherwise,
after each training iteration eta
is updated as follows:
for
"time-based"
schedule aseta0 / (1 + eta_decay * epoch)
;for
"step-based"
schedule aseta0 * eta_decay ^ floor((1 + epoch) / eta_drop_rate)
;for
"exponential"
schedule aseta0 / exp(eta_decay * epoch)
.
By default, the size of the training iteration is one epoch, it becomes
nepochs_validation
when validation dataset is specified.
float
L1 regularization parameter, should be non-negative.
float
L2 regularization parameter, should be non-negative.
float
Number of training epochs, should be non-negative. When nepochs
is an integer number, the model will train on all the data
provided to .fit()
method nepochs
times. If nepochs
has a fractional part {nepochs}
, the model will train on all
the data [nepochs]
times, i.e. the integer part of nepochs
.
Plus, it will also perform an additional training iteration
on the {nepochs}
fraction of data.
bool
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be used
for computations. It is not guaranteed that setting
double_precision
to True
will automatically improve
the model accuracy. It will, however, roughly double the memory
footprint of the LinearModel
object.
bool
An option to indicate if a “negative” class should be created
in the case of multinomial classification. For the “negative”
class the model will train on all the negatives, and if
a new label is encountered in the target column, its
coefficients will be initialized to the current “negative” class coefficients.
If negative_class
is set to False
, the initial coefficients
become zeros.
"binomial"
| "multinomial"
| "regression"
| "auto"
The model type to be built. When this option is "auto"
then the model type will be automatically chosen based on
the target column stype
.
int
Seed for the quasi-random number generator that is used for data shuffling when fitting the model, should be non-negative. If seed is zero, no shuffling is performed.
LinearModelParams
Named tuple of the above parameters. One can pass either this tuple, or any combination of the individual parameters to the constructor, but not both at the same time.
ValueError
The exception is raised if both the params
and one of the
individual model parameters are passed at the same time.
datatable.models.LinearModel.eta0¶
Learning rate.
float
Current eta0
value.
float
New eta0
value, should be positive.
ValueError
The exception is raised when new_eta0
is not positive.
datatable.models.LinearModel.eta_decay¶
Decay for the "time-based"
and "step-based"
learning rate schedules.
float
Current eta_decay
value.
float
New eta_decay
value, should be non-negative.
ValueError
The exception is raised when new_eta_decay
is negative.
datatable.models.LinearModel.eta_drop_rate¶
Drop rate for the "step-based"
learning rate schedule.
float
Current eta_drop_rate
value.
float
New eta_drop_rate
value, should be positive.
ValueError
The exception is raised when new_eta_drop_rate
is not positive.
datatable.models.LinearModel.eta_schedule¶
Learning rate schedule
"constant"
for constanteta
;"time-based"
for time-based schedule;"step-based"
for step-based schedule;"exponential"
for exponential schedule.
str
Current eta_schedule
value.
"constant"
| "time-based"
| "step-based"
| "exponential"
New eta_schedule
value.
datatable.models.LinearModel.double_precision¶
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be
used for computations. This option is read-only and can only be set
during the LinearModel
object construction
.
bool
Current double_precision
value.
datatable.models.LinearModel.fit()¶
Train model on the input samples and targets using the parallel stochastic gradient descent method.
Frame
Training frame.
Frame
Target frame having as many rows as X_train
and one column.
Frame
Validation frame having the same number of columns as X_train
.
Frame
Validation target frame of shape (nrows, 1)
.
float
Parameter that specifies how often, in epoch units, validation error should be checked.
float
The improvement of the relative validation error that should be
demonstrated by the model within nepochs_validation
epochs,
otherwise the training will stop.
int
Number of iterations that is used to average the validation error.
Each iteration corresponds to nepochs_validation
epochs.
LinearModelFitOutput
LinearModelFitOutput
is a Tuple[float, float]
with two fields: epoch
and loss
,
representing the final fitting epoch and the final loss, respectively.
If validation dataset is not provided, the returned epoch
equals to
nepochs
and the loss
is just float('nan')
.
.predict()
– predict for the input samples..reset()
– reset the model.
datatable.models.LinearModel.is_fitted()¶
Report model status.
bool
True
if model is trained, False
otherwise.
datatable.models.LinearModel.labels¶
Classification labels the model was trained on.
Frame
A one-column frame with the classification labels. In the case of numeric regression, the label is the target column name.
datatable.models.LinearModel.lambda1¶
L1 regularization parameter.
float
Current lambda1
value.
float
New lambda1
value, should be non-negative.
ValueError
The exception is raised when new_lambda1
is negative.
datatable.models.LinearModel.lambda2¶
L2 regularization parameter.
float
Current lambda2
value.
float
New lambda2
value, should be non-negative.
ValueError
The exception is raised when new_lambda2
is negative.
datatable.models.LinearModel.model¶
Trained models coefficients.
Frame
A frame of shape (nfeatures + 1, nlabels)
, where nlabels
is
the number of labels the model was trained on, and
nfeatures
is the number of features. Each column contains
model coefficients for the corresponding label: starting from
the intercept and following by the coefficients for each of
of the nfeatures
features.
datatable.models.LinearModel.model_type¶
A type of the model LinearModel
should build:
"binomial"
for binomial classification;"multinomial"
for multinomial classification;"regression"
for numeric regression;"auto"
for automatic model type detection based on the target columnstype
.
This option is read-only for a trained model.
str
Current model_type
value.
"binomial"
| "multinomial"
| "regression"
| "auto"
New model_type
value.
ValueError
The exception is raised when trying to change this option for a model that has already been trained.
datatable.models.LinearModel.negative_class¶
An option to indicate if a “negative” class should be created
in the case of multinomial classification. For the “negative”
class the model will train on all the negatives, and if
a new label is encountered in the target column, its
coefficients are initialized to the current “negative” class coefficients.
If negative_class
is set to False
, the initial coefficients
become zeros.
This option is read-only for a trained model.
bool
Current negative_class
value.
bool
New negative_class
value.
ValueError
The exception is raised when trying to change this option for a model that has already been trained.
datatable.models.LinearModel.nepochs¶
Number of training epochs. When nepochs
is an integer number,
the model will train on all the data provided to .fit()
method
nepochs
times. If nepochs
has a fractional part {nepochs}
,
the model will train on all the data [nepochs]
times,
i.e. the integer part of nepochs
. Plus, it will also perform an additional
training iteration on the {nepochs}
fraction of data.
float
Current nepochs
value.
float
New nepochs
value, should be non-negative.
ValueError
The exception is raised when new_nepochs
value is negative.
datatable.models.LinearModel.params¶
LinearModel
model parameters as a named tuple LinearModelParams
,
see .__init__()
for more details.
This option is read-only for a trained model.
LinearModelParams
Current params
value.
LinearModelParams
New params
value.
ValueError
The exception is raised when
trying to change this option for a model that has already been trained;
individual parameter values are incompatible with the corresponding setters.
datatable.models.LinearModel.predict()¶
Predict for the input samples.
datatable.models.LinearModel.reset()¶
Reset linear model by resetting all the model coefficients and labels.
None
.fit()
– train model on a dataset..predict()
– predict on a dataset.
datatable.models.LinearModel.seed¶
Seed for the quasi-random number generator that is used for
data shuffling when fitting the model. If seed is 0
,
no shuffling is performed.
datatable.models.aggregate()¶
Aggregate a frame into clusters. Each cluster consists of a set of members, i.e. a subset of the input frame, and is represented by an exemplar, i.e. one of the members.
For one- and two-column frames the aggregation is based on the standard equal-interval binning for numeric columns and grouping operation for string columns.
In the general case, a parallel one-pass ad hoc algorithm is employed. It starts with an empty exemplar list and does one pass through the data. If a partucular observation falls into a bubble with a given radius and the center being one of the exemplars, it marks this observation as a member of that exemplar’s cluster. If there is no appropriate exemplar found, the observation is marked as a new exemplar.
If the fixed_radius
is None
, the algorithm will start
with the delta
, that is radius squared, being equal to the machine precision.
When the number of gathered exemplars becomes larger than nd_max_bins
,
the following procedure is performed:
find the mean distance between all the gathered exemplars;
merge all the exemplars that are within the half of this distance;
adjust
delta
by taking into account the initial bubble radius;save the exemplar’s merging information for the final processing.
If the fixed_radius
is set to a valid numeric value, the algorithm
will stick to that value and will not adjust delta
.
Note: the general n-dimensional algorithm takes into account the numeric columns only, and all the other columns are ignored.
Parameters¶
Frame
The input frame containing numeric or string columns.
int
Number of bins for 1D aggregation.
int
Number of bins for the first column for 2D aggregation.
int
Number of bins for the second column for 2D aggregation.
int
Maximum number of exemplars for ND aggregation. It is guaranteed
that the ND algorithm will return less than nd_max_bins
exemplars,
but the exact number may vary from run to run due to parallelization.
int
Number of columns at which the projection method is used for ND aggregation.
int
Seed to be used for the projection method.
bool
An option to indicate whether double precision, i.e. float64
,
or single precision, i.e. float32
, arithmetic should be used
for computations.
float
Fixed radius for ND aggregation, use it with caution.
If set, nd_max_bins
will have no effect and in the worst
case number of exemplars may be equal to the number of rows
in the data. For big data this may result in extremly large
execution times. Since all the columns are normalized to [0, 1)
,
the fixed_radius
value should be chosen accordingly.
Tuple[Frame, Frame]
The first element in the tuple is the aggregated frame, i.e.
the frame containing exemplars, with the shape of
(nexemplars, frame.ncols + 1)
, where nexemplars
is
the number of gathered exemplars. The first frame.ncols
columns
are the columns from the input frame, and the last column
is the members_count
that has stype int32
containing
number of members per exemplar.
The second element in the tuple is the members frame with the shape of
(frame.nrows, 1)
. Each row in this frame corresponds to the
row with the same id in the input frame
. The single column exemplar_id
has an stype of int32
and contains the exemplar ids that a particular
member belongs to. These ids are effectively the ids of
the exemplar’s rows from the input frame.
TypeError
The exception is raised when one of the frame
’s columns has an
unsupported stype, i.e. there is a column that is both non-numeric
and non-string.
datatable.models.kfold()¶
Perform k-fold split of data with nrows
rows into nsplits
train/test
subsets. The dataset itself is not passed to this function:
it is sufficient to know only the number of rows in order to decide
how the data should be split.
The range [0; nrows)
is split into nsplits
approximately equal parts,
i.e. folds, and then each i
-th split will use the i
-th fold as a
test part, and all the remaining rows as the train part. Thus, i
-th split is
comprised of:
train rows:
[0; i*nrows/nsplits) + [(i+1)*nrows/nsplits; nrows)
;test rows:
[i*nrows/nsplits; (i+1)*nrows/nsplits)
.
where integer division is assumed.
Parameters¶
int
The number of rows in the frame that is going to be split.
int
Number of folds, must be at least 2
, but not larger than nrows
.
List[Tuple]
This function returns a list of nsplits
tuples (train_rows, test_rows)
,
where each component of the tuple is a rows selector that can be applied
to any frame with nrows
rows to select the desired folds.
Some of these row selectors will be simple python ranges,
others will be single-column Frame objects.
See Also¶
kfold_random()
– Perform randomized k-fold split.
datatable.models.kfold_random()¶
Perform randomized k-fold split of data with nrows
rows into
nsplits
train/test subsets. The dataset itself is not passed to this
function: it is sufficient to know only the number of rows in order to decide
how the data should be split.
The train/test subsets produced by this function will have the following properties:
all test folds will be of approximately the same size
nrows/nsplits
;all observations have equal ex-ante chance of getting assigned into each fold;
the row indices in all train and test folds will be sorted.
The function uses single-pass parallelized algorithm to construct the folds.
Parameters¶
int
The number of rows in the frame that you want to split.
int
Number of folds, must be at least 2
, but not larger than nrows
.
int
Seed value for the random number generator used by this function. Calling the function several times with the same seed values will produce same results each time.
datatable.options¶
Repository of datatable configuration options. This namespace contains the following option groups:
Debug options. |
|
Display options. |
|
|
|
|
|
Progress reporting options. |
It also contains the following individual options:
Number of threads used by datatable for parallel computations. |
datatable.options.debug¶
This namespace contains the following debug options:
The number of characters to use per a function/method argument. |
|
Option that enables logging of the debug information. |
|
The custom logger object. |
|
Option that enables logging of the function/method arguments. |
datatable.options.debug.arg_max_size¶
This option limits the display size of each argument in order
to prevent potentially huge outputs. It has no effect,
if debug.report_args
is
False
.
Parameters¶
int
Current arg_max_size
value. Initially, this option is set to 100
.
int
New arg_max_size
value, should be non-negative.
If new_arg_max_size < 10
, then arg_max_size
will be set to 10
.
TypeError
The exception is raised when new_arg_max_size
is negative.
datatable.options.debug.enabled¶
This option controls whether or not all the calls to the datatable core functions should be logged.
Parameters¶
bool
Current enabled
value. Initially, this option is set to False
.
bool
New enabled
value. If set to True
, all the calls to the datatable
core functions will be logged along with their respective timings.
datatable.options.debug.logger¶
The logger object used for reporting calls to datatable core
functions. This option has no effect if
debug.enabled
is False
.
Parameters¶
object
Current logger
value. Initially, this option is set to None
,
meaning that the built-in logger should be used.
object
New logger
value.
TypeError
The exception is raised when new_logger
is not an object
having a method .debug(self, msg)
.
datatable.options.debug.report_args¶
This option controls whether log messages for the function and method calls should contain information about the arguments of those calls.
Parameters¶
bool
Current report_args
value. Initially, this option is set to False
.
object
New report_args
value.
datatable.options.display¶
This namespace contains the following display options:
Option that controls if the unicode characters are allowed. |
|
The number of top rows to display when the frame view is truncated. |
|
Option that controls if the interactive view is enabled or not. |
|
The threshold for the column’s width to be truncated. |
|
The threshold for the number of rows in a frame to be truncated. |
|
The number of bottom rows to display when the frame view is truncated. |
|
Option that controls if colors should be used in the console. |
datatable.options.display.allow_unicode¶
This option controls whether or not unicode characters are allowed in the datatable output.
Parameters¶
bool
Current allow_unicode
value. Initially, this option is set to True
.
bool
New allow_unicode
value. If True
, datatable will allow unicode
characters (encoded as UTF-8) to be printed into the output.
If False
, then unicode characters will either be avoided, or
hex-escaped as necessary.
datatable.options.display.head_nrows¶
This option controls the number of rows from the top of a frame
to be displayed when the frame’s output is truncated due to the total number of
rows exceeding display.max_nrows
value.
Parameters¶
int
Current head_nrows
value. Initially, this option is set to 15
.
int
New head_nrows
value, should be non-negative.
ValueError
The exception is raised when the new_head_nrows
is negative.
datatable.options.display.interactive¶
Warning: This option is currently not working properly [#2669]
This option controls the behavior of a Frame when it is viewed in a
text console. To enter the interactive mode manually, one can still
call the Frame.view()
method.
Parameters¶
bool
Current interactive
value. Initially, this option is set to False
.
bool
New interactive
value. If True
, frames will be shown in
the interactove mode, allowing you to navigate the rows/columns
with the keyboard. If False
, frames will be shown in regular,
non-interactive mode.
datatable.options.display.max_column_width¶
This option controls the threshold for the column’s width
to be truncated. If a column’s name or its values exceed
the max_column_width
, the content of the column is truncated
to max_column_width
characters when printed.
This option applies to both the rendering of a frame in a terminal, and the rendering in a Jupyter notebook.
Parameters¶
int
Current max_column_width
value. Initially, this option is set to 100
.
int
New max_column_width
value, cannot be less than 2
.
If new_max_column_width
equals to None
, the column’s content
would never be truncated.
ValueError
The exception is raised when the new_max_column_width
is less than 2
.
datatable.options.display.max_nrows¶
This option controls the threshold for the number of rows in a frame to be truncated when printed to the console.
If a frame has more rows than max_nrows
, it will be displayed
truncated: only its first
head_nrows
and last
tail_nrows
rows will be printed. Otherwise, no truncation will occur.
It is recommended to have head_nrows + tail_nrows <= max_nrows
.
Parameters¶
int
Current max_nrows
value. Initially, this option is set to 30
.
int
New max_nrows
value. If this option is set to None
or
to a negative value, no frame truncation will occur when printed,
which may cause the console to become unresponsive for frames
with large number of rows.
datatable.options.display.tail_nrows¶
This option controls the number of rows from the bottom of a frame
to be displayed when the frame’s output is truncated due to the total number of
rows exceeding max_nrows
value.
Parameters¶
int
Current tail_nrows
value. Initially, this option is set to 5
.
int
New tail_nrows
value, should be non-negative.
ValueError
The exception is raised when the new_tail_nrows
is negative.
datatable.options.display.use_colors¶
This option controls whether or not to use colors when printing datatable messages into the console. Turn this off if your terminal is unable to display ANSI escape sequences, or if the colors make output not legible.
Parameters¶
bool
Current use_colors
value. Initially, this option is set to True
.
bool
New use_colors
value.
datatable.options.frame¶
This namespace contains the following Frame
options:
Initial value of the default column name index. |
|
Default column name prefix. |
datatable.options.frame.names_auto_index¶
This option controls the starting index that is used for auto-naming
the columns. By default, the names that datatable assigns to frame’s columns are
C0
, C1
, C2
, etc. Setting names_auto_index
, for instance, to
1
will cause the columns to be named as C1
, C2
, C3
, etc.
Parameters¶
int
Current names_auto_index
value. Initially, this option is set to 0
.
int
New names_auto_index
value.
See Also¶
Name mangling – tutorial on name mangling.
datatable.options.frame.names_auto_prefix¶
This option controls the prefix that is used for auto-naming
the columns. By default, the names that datatable assigns to frame’s columns are
C0
, C1
, C2
, etc. Setting names_auto_prefix
, for instance, to
Z
will cause the columns to be named as Z1
, Z2
, Z3
, etc.
Parameters¶
str
Current names_auto_prefix
value. Initially, this option is set to C
.
str
New names_auto_prefix
value.
See Also¶
Name mangling – tutorial on name mangling.
datatable.options.fread¶
datatable.options.fread.log¶
This property controls the following logging options:
Option that controls logs anonymization. |
|
Option that controls escaping of the unicode characters. |
datatable.options.fread.log.anonymize¶
This option controls logs anonymization that is useful in production systems, when reading sensitive data that must not accidentally leak into log files or be printed with the error messages.
bool
Current anonymize
value. Initially, this option is set to False
.
bool
New anonymize
value. If True
, any snippets of data being read
that are printed in the log will be first anonymized by converting
all non-zero digits to 1
, all lowercase letters to a
,
all uppercase letters to A
, and all unicode characters to U
.
If False
, no data anonymization will be performed.
datatable.options.fread.log.escape_unicode¶
This option controls escaping of the unicode characters.
Use this option if your terminal cannot print unicode, or if the output gets somehow corrupted because of the unicode characters.
bool
Current escape_unicode
value. Initially, this option is set to False
.
bool
If True
, all unicode characters in the verbose log will be written
in hexadecimal notation. If False
, no escaping of the unicode
characters will be performed.
datatable.options.progress¶
This namespace contains the following progress reporting options:
Option that controls if the datatable tasks could be interrupted. |
|
A custom progress-reporting function. |
|
Option that controls if the progress bar is cleared on success. |
|
Option that controls if the progress reporting is enabled. |
|
The minimum duration of a task to show the progress bar. |
|
The progress bar update frequency. |
datatable.options.progress.allow_interruption¶
This option controls if the datatable tasks could be interrupted.
Parameters¶
bool
Current allow_interruption
value. Initially, this option is set to True
.
bool
New allow_interruption
value. If True
, datatable will be allowed
to handle the SIGINT
signal to interrupt long-running tasks.
If False
, it will not be possible to interrupt tasks with SIGINT
.
datatable.options.progress.callback¶
This option controls the custom progress-reporting function.
Parameters¶
function
Current callback
value. Initially, this option is set to None
.
function
New callback
value. If None
, then the built-in progress-reporting
function will be used. Otherwise, the new_callback
specifies a function
to be called at each progress event. The function should take a single
parameter p
, which is a namedtuple with the following fields:
p.progress
is a float in the range0.0 .. 1.0
;p.status
is a string, one of'running'
,'finished'
,'error'
or'cancelled'
;p.message
is a custom string describing the operation currently being performed.
datatable.options.progress.clear_on_success¶
This option controls if the progress bar is cleared on success.
Parameters¶
bool
Current clear_on_success
value. Initially, this option is set to False
.
bool
New clear_on_success
value. If True
, the progress bar is cleared when
job finished successfully. If False
, the progress remains visible
even when the job has already finished.
datatable.options.progress.enabled¶
This option controls if the progress reporting is enabled.
Parameters¶
bool
Current enabled
value. Initially, this option is set to True
if the stdout
is connected to a terminal or a Jupyter Notebook,
and False
otherwise.
bool
New enabled
value. If True
, the progress reporting
functionality will be turned on. If False
, it is turned off.
datatable.options.progress.min_duration¶
This option controls the minimum duration of a task to show the progress bar.
Parameters¶
float
Current min_duration
value. Initially, this option is set to 0.5
.
float
New min_duration
value. The progress bar will not be shown
if the duration of an operation is smaller than new_min_duration
.
If this value is non-zero, then the progress bar will only be shown
for long-running operations, whose duration (estimated or actual)
exceeds this threshold.
datatable.options.progress.updates_per_second¶
This option controls the progress bar update frequency.
Parameters¶
float
Current updates_per_second
value. Initially, this option is set to 25.0
.
float
New updates_per_second
value. This is the number of times per second
the display of the progress bar should be updated.
datatable.options.nthreads¶
This option controls the number of threads used by datatable for parallel calculations.
Parameters¶
int
Current nthreads
value. Initially, this option is set to
the value returned by C++ call std::thread::hardware_concurrency()
,
and usually equals to the number of available cores.
int
New nthreads
value. It can be greater or smaller than the initial setting.
For example, setting nthreads = 1
will force the library into
a single-threaded mode. Setting nthreads
to 0
will restore
the initial value equal to the number of processor cores.
Setting nthreads
to a value less than 0
is equivalent to requesting
that fewer threads than the maximum.
datatable.re¶
Search for a regular expression within a column. |
datatable.str¶
Compute length of a string column. |
|
Apply a slice to a string column. |
|
Split and nhot-encode a single-column frame |
datatable.str.len()¶
Compute lengths of values in a string column.
datatable.str.slice()¶
Apply slice [start:stop:step]
to each value in a column
of string
type.
Instead of this function you can directly apply a slice expression to the column expression:
- ``f.A[1:-1]`` is equivalent to
- ``dt.str.slice(f.A, 1, -1)``.
Parameters¶
FExpr[str]
The column to which the slice should be applied.
FExpr[str]
A column containing sliced string values from the source column.
Examples¶
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates",
"eggplants", "figs", "grapes", "kiwi"])
DT[:, dt.str.slice(f.A, None, 5)]
A | ||
---|---|---|
str32 | ||
0 | apple | |
1 | banan | |
2 | cherr | |
3 | dates | |
4 | eggpl | |
5 | figs | |
6 | grape | |
7 | kiwi |
datatable.str.split_into_nhot()¶
Split and nhot-encode a single-column frame.
Each value in the frame, having a single string column, is split according
to the provided separator sep
, the whitespace is trimmed, and
the resulting pieces (labels) are converted into the individual columns
of the output frame.
Parameters¶
Frame
An input single-column frame. The column stype must be either str32
or str64
.
str
Single-character separator to be used for splitting.
bool
An option to control whether the resulting column names, i.e. labels,
should be sorted. If set to True
, the column names are returned in
alphabetical order, otherwise their order is not guaranteed
due to the algorithm parallelization.
Frame
The output frame. It will have as many rows as the input frame, and as many boolean columns as there were unique labels found. The labels will also become the output column names.
ValueError
| TypeError
dt.exceptions.ValueError
Raised if the input frame is missing or it has more than one column. It is also raised if
sep
is not a single-character string.dt.exceptions.TypeError
Raised if the single column of
frame
has non-string stype.
Examples¶
DT = dt.Frame(["cat,dog", "mouse", "cat,mouse", "dog,rooster", "mouse,dog,cat"])
DT
C0 | ||
---|---|---|
str32 | ||
0 | cat,dog | |
1 | mouse | |
2 | cat,mouse | |
3 | dog,rooster | |
4 | mouse,dog,cat |
dt.split_into_nhot(DT)
cat | dog | mouse | rooster | ||
---|---|---|---|---|---|
bool8 | bool8 | bool8 | bool8 | ||
0 | 1 | 1 | 0 | 0 | |
1 | 0 | 0 | 1 | 0 | |
2 | 1 | 0 | 1 | 0 | |
3 | 0 | 1 | 0 | 1 | |
4 | 1 | 1 | 1 | 0 |
datatable.time¶
Return day component of a date. |
|
Compute day of week for the given date. |
|
Return hour component of a timestamp. |
|
Return minute component of a timestamp. |
|
Return month component of a date. |
|
Return nanosecond component of a timestamp. |
|
Return the number of seconds in a timestamp. |
|
Return year component of a date. |
|
Create a |
|
Create a |
datatable.time.day()¶
Retrieve the “day” component of a date32 or time64 column.
Parameters¶
FExpr[date32]
| FExpr[time64]
A column for which you want to compute the day part.
FExpr[int32]
The day part of the source column.
Examples¶
DT = dt.Frame([1, 1000, 100000], stype='date32')
DT[:, {'date': f[0], 'day': dt.time.day(f[0])}]
date | day | ||
---|---|---|---|
date32 | int32 | ||
0 | 1970-01-02 | 2 | |
1 | 1972-09-27 | 27 | |
2 | 2243-10-17 | 17 |
datatable.time.day_of_week()¶
For a given date column compute the corresponding days of week.
Days of week are returned as integers from 1 to 7, where 1 represents Monday, and 7 is Sunday. Thus, the return value of this function matches the ISO standard.
Parameters¶
FExpr[date32]
| FExpr[time64]
The date32 (or time64) column for which you need to calculate days of week.
FExpr[int32]
An integer column, with values between 1 and 7 inclusive.
Examples¶
DT = dt.Frame([18000, 18600, 18700, 18800, None], stype='date32')
DT[:, {"date": f[0], "day-of-week": dt.time.day_of_week(f[0])}]
date | day-of-week | ||
---|---|---|---|
date32 | int32 | ||
0 | 2019-04-14 | 7 | |
1 | 2020-12-04 | 5 | |
2 | 2021-03-14 | 7 | |
3 | 2021-06-22 | 2 | |
4 | NA | NA |
datatable.time.hour()¶
Retrieve the “hour” component of a time64 column. The returned value will always be in the range [0; 23].
Parameters¶
FExpr[time64]
A column for which you want to compute the hour part.
FExpr[int32]
The hour part of the source column.
Examples¶
from datetime import datetime as d
DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)])
DT[:, {'time': f[0], 'hour': dt.time.hour(f[0])}]
time | hour | ||
---|---|---|---|
time64 | int32 | ||
0 | 2020-05-11T12:00:00 | 12 | |
1 | 2021-06-14T16:10:59.394873 | 16 |
See Also¶
minute()
– retrieve the “minute” component of a timestampsecond()
– retrieve the “second” component of a timestampnanosecond()
– retrieve the “nanosecond” component of a timestamp
datatable.time.minute()¶
Retrieve the “minute” component of a time64 column. The produced column will have values in the range [0; 59].
Parameters¶
FExpr[time64]
A column for which you want to compute the minute part.
FExpr[int32]
The minute part of the source column.
Examples¶
from datetime import datetime as d
DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)])
DT[:, {'time': f[0], 'minute': dt.time.minute(f[0])}]
time | minute | ||
---|---|---|---|
time64 | int32 | ||
0 | 2020-05-11T12:00:00 | 0 | |
1 | 2021-06-14T16:10:59.394873 | 10 |
See Also¶
hour()
– retrieve the “hour” component of a timestampsecond()
– retrieve the “second” component of a timestampnanosecond()
– retrieve the “nanosecond” component of a timestamp
datatable.time.month()¶
Retrieve the “month” component of a date32 or time64 column.
Parameters¶
FExpr[date32]
| FExpr[time64]
A column for which you want to compute the month part.
FExpr[int32]
The month part of the source column.
Examples¶
DT = dt.Frame([1, 1000, 100000], stype='date32')
DT[:, {'date': f[0], 'month': dt.time.month(f[0])}]
date | month | ||
---|---|---|---|
date32 | int32 | ||
0 | 1970-01-02 | 1 | |
1 | 1972-09-27 | 9 | |
2 | 2243-10-17 | 10 |
datatable.time.nanosecond()¶
Retrieve the “nanosecond” component of a time64 column. The produced column will have values in the range [0; 999999999].
Parameters¶
FExpr[time64]
A column for which you want to compute the nanosecond part.
FExpr[int32]
The “nanosecond” part of the source column.
Examples¶
from datetime import datetime as d
DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)])
DT[:, {'time': f[0], 'ns': dt.time.nanosecond(f[0])}]
time | ns | ||
---|---|---|---|
time64 | int32 | ||
0 | 2020-05-11T12:00:00 | 0 | |
1 | 2021-06-14T16:10:59.394873 | 394873000 |
datatable.time.second()¶
Retrieve the “second” component of a time64 column. The produced column will have values in the range [0; 59].
Parameters¶
FExpr[time64]
A column for which you want to compute the second part.
FExpr[int32]
The “second” part of the source column.
Examples¶
from datetime import datetime as d
DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)])
DT[:, {'time': f[0], 'second': dt.time.second(f[0])}]
time | second | ||
---|---|---|---|
time64 | int32 | ||
0 | 2020-05-11T12:00:00 | 0 | |
1 | 2021-06-14T16:10:59.394873 | 59 |
See Also¶
hour()
– retrieve the “hour” component of a timestampminute()
– retrieve the “minute” component of a timestampnanosecond()
– retrieve the “nanosecond” component of a timestamp
datatable.time.year()¶
Retrieve the “year” component of a date32 or time64 column.
Parameters¶
FExpr[date32]
| FExpr[time64]
A column for which you want to compute the year part.
FExpr[int32]
The year part of the source column.
Examples¶
DT = dt.Frame([1, 1000, 100000], stype='date32')
DT[:, {'date': f[0], 'year': dt.time.year(f[0])}]
date | year | ||
---|---|---|---|
date32 | int32 | ||
0 | 1970-01-02 | 1970 | |
1 | 1972-09-27 | 1972 | |
2 | 2243-10-17 | 2243 |
datatable.time.ymd()¶
Create a date32 column out of year
, month
and day
components.
This function performs range checks on month
and day
columns: if a
certain combination of year/month/day is not valid in the Gregorian
calendar, then an NA value will be produced in that row.
Parameters¶
FExpr[int]
The year part of the resulting date32 column.
FExpr[int]
The month part of the resulting date32 column. Values in this column are expected to be in the 1 .. 12 range.
FExpr[int]
The day part of the resulting date32 column. Values in this column
should be from 1 to last_day_of_month(year, month)
.
FExpr[date32]
Examples¶
DT = dt.Frame(y=[2005, 2010, 2015], m=[2, 3, 7])
DT[:, dt.time.ymd(f.y, f.m, 30)]
C0 | ||
---|---|---|
date32 | ||
0 | NA | |
1 | 2010-03-30 | |
2 | 2015-07-30 |
datatable.time.ymdt()¶
Create a time64 column out of year
, month
, day
, hour
, minute
,
second
and nanosecond
(optional) components. Alternatively, instead
of year-month-day triple you can pass date
argument of type date32
.
This function performs range checks on month
and day
columns: if a
certain combination of year/month/day is not valid in the Gregorian
calendar, then an NA value will be produced in that row.
At the same time, there are no range checks for time components. Thus,
you can, for example, pass second=3600
instead of hour=1
.
Parameters¶
FExpr[int]
The year part of the resulting time64 column.
FExpr[int]
The month part of the resulting time64 column. Values in this column must be in the 1 .. 12 range.
FExpr[int]
The day part of the resulting time64 column. Values in this column
should be from 1 to last_day_of_month(year, month)
.
FExpr[int]
The hour part of the resulting time64 column.
FExpr[int]
The minute part of the resulting time64 column.
FExpr[int]
The second part of the resulting time64 column.
FExpr[int]
The nanosecond part of the resulting time64 column. This parameter is optional.
FExpr[time64]
Examples¶
DT = dt.Frame(Y=[2001, 2003, 2005, 2020, 1960],
M=[1, 5, 4, 11, 8],
D=[12, 18, 30, 1, 14],
h=[7, 14, 22, 23, 12],
m=[15, 30, 0, 59, 0],
s=[12, 23, 0, 59, 27],
ns=[0, 0, 0, 999999000, 123000])
DT[:, [f[:], dt.time.ymdt(f.Y, f.M, f.D, f.h, f.m, f.s, f.ns)]]
Y | M | D | h | m | s | ns | C0 | ||
---|---|---|---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | int32 | int32 | time64 | ||
0 | 2001 | 1 | 12 | 7 | 15 | 12 | 0 | 2001-01-12T07:15:12 | |
1 | 2003 | 5 | 18 | 14 | 30 | 23 | 0 | 2003-05-18T14:30:23 | |
2 | 2005 | 4 | 30 | 22 | 0 | 0 | 0 | 2005-04-30T22:00:00 | |
3 | 2020 | 11 | 1 | 23 | 59 | 59 | 999999000 | 2020-11-01T23:59:59.999999 | |
4 | 1960 | 8 | 14 | 12 | 0 | 27 | 123000 | 1960-08-14T12:00:27.000123 |
datatable.FExpr¶
FExpr is an object that encapsulates computations to be done on a frame.
FExpr objects are rarely constructed directly (though it is possible too),
instead they are more commonly created as inputs/outputs from various
functions in datatable
.
Consider the following example:
math.sin(2 * f.Angle)
Here accessing column “Angle” in namespace f
creates an FExpr
.
Multiplying this FExpr
by a python scalar 2
creates a new FExpr
.
And finally, applying the sine function creates yet another FExpr
. The
resulting expression can be applied to a frame via the
DT[i,j]
method, which will compute that expression
using the data of that particular frame.
Thus, an FExpr
is a stored computation, which can later be applied to a
Frame, or to multiple frames.
Because of its delayed nature, an FExpr
checks its correctness at the time
when it is applied to a frame, not sooner. In particular, it is possible for
the same expression to work with one frame, but fail with another. In the
example above, the expression may raise an error if there is no column named
“Angle” in the frame, or if the column exists but has non-numeric type.
Most functions in datatable that accept an FExpr
as an input, return
a new FExpr
as an output, thus creating a tree of FExpr
s as the
resulting evaluation graph.
Also, all functions that accept FExpr
s as arguments, will also accept
certain other python types as an input, essentially converting them into
FExpr
s. Thus, we will sometimes say that a function accepts FExpr-like
objects as arguments.
All binary operators op(x, y)
listed below work when either x
or y
, or both are FExpr
s.
Construction¶
Create an |
|
Append another FExpr. |
|
Remove columns from the FExpr. |
Arithmeritc operators¶
Addition |
|
Subtraction |
|
Multiplication |
|
Division |
|
Integer division |
|
Modulus |
|
Power |
|
Unary plus |
|
Unary minus |
Bitwise operators¶
Bitwise AND |
|
Bitwise OR |
|
Bitwise XOR |
|
Bitwise NOT |
|
Left shift |
|
Right shift |
Relational operators¶
Equal |
|
Not equal |
|
Less than |
|
Less than or equal |
|
Greater than |
|
Greater than or equal |
Equivalents of base datatable functions¶
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
|
Same as |
datatable.FExpr.__add__()¶
Add two FExprs together, which corresponds to python operator +
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the +
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The result of adding two columns with different stypes will have the following stype:
max(x.stype, y.stype, int32)
if both columns are numeric (i.e. bool, int or float);
str32
/str64
if at least one of the columns is a string. In this case the+
operator implements string concatenation, same as in Python.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x + y
.
datatable.FExpr.__and__()¶
Compute bitwise AND of x
and y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the &
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The AND operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.
When both x
and y
are boolean, then the bitwise AND operator is
equivalent to logical AND. This can be used to combine several logical
conditions into a compound (since Python doesn’t allow overloading of
operator and
). Beware, however, that &
has higher precedence
than and
, so it is advisable to always use parentheses:
DT[(f.x >= 0) & (f.x <= 1), :]
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x & y
.
Notes¶
Note
Use x & y
in order to AND two boolean FExpr
s. Using standard
Python keyword and
will result in an error.
datatable.FExpr.__bool__()¶
Using this operator will result in a TypeError
.
The boolean-cast operator is used by Python whenever it wants to
know whether the object is equivalent to a single True
or False
value. This is not applicable for a dt.FExpr
, which represents
stored computation on a column or multiple columns. As such, an
error is raised.
In order to convert a column into the boolean stype, you can use the
type-cast operator dt.bool8(x)
.
datatable.FExpr.__eq__()¶
Compare whether values in columns x
and y
are equal.
Like all other FExpr operators, the equality operator is elementwise:
it produces a column where each element is the result of comparison
x[i] == y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the ==
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The equality operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison. In practice it means, for example, that `1 == "1"
.
Lastly, the comparison x == None
is exactly equivalent to the
isna()
function.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x == y
. The produced column will
have stype bool8
.
datatable.FExpr.__floordiv__()¶
Perform integer division of two FExprs, i.e. x // y
.
The modulus and integer division together satisfy the identity
that x == (x // y) * y + (x % y)
for all non-zero values of y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the //
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The integer division operation can only be applied to integer columns.
The resulting column will have stype equal to the largest of the stypes
of both columns, but at least int32
.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x // y
.
datatable.FExpr.__ge__()¶
Compare whether x >= y
.
Like all other FExpr operators, the greater-than-or-equal operator is
elementwise: it produces a column where each element is the result of
comparison x[i] >= y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the >=
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The greater-than-or-equal operator can be applied to columns of any type,
and the types of x
and y
are allowed to be different. In the
latter case the columns will be converted into a common stype before the
comparison.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x >= y
. The produced column will
have stype bool8
.
datatable.FExpr.__getitem__()¶
Apply a slice to the string column represented by this FExpr.
Parameters¶
FExpr[str]
slice
The slice will be applied to each value in the string column self
.
FExpr[str]
Examples¶
DT = dt.Frame(season=["Winter", "Summer", "Autumn", "Spring"], i=[1, 2, 3, 4])
DT[:, {"start": f.season[:-f.i], "end": f.season[-f.i:]}]
start | end | ||
---|---|---|---|
str32 | str32 | ||
0 | Winte | r | |
1 | Summ | er | |
2 | Aut | umn | |
3 | Sp | ring |
See Also¶
dt.str.slice()
– the equivalent function indt.str
module.
datatable.FExpr.__gt__()¶
Compare whether x > y
.
Like all other FExpr operators, the greater-than operator is elementwise:
it produces a column where each element is the result of comparison
x[i] > y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the >
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The greater-than operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x > y
. The produced column will
have stype bool8
.
datatable.FExpr.__init__()¶
Create a new dt.FExpr
object out of e
.
The FExpr
serves as a simple wrapper of the underlying object,
allowing it to be combined with othef FExpr
s.
This constructor almost never needs to be run manually by the user.
Parameters¶
None
| bool
| int
| str
| float
| slice
| list
| tuple
| dict
| type
| stype
| ltype
| Generator
| FExpr
| Frame
| range
| pd.DataFrame
| pd.Series
| np.array
| np.ma.masked_array
The argument that will be converted into an FExpr
.
datatable.FExpr.__invert__()¶
Compute bitwise NOT of x
, which corresponds to python operation ~x
.
If x
is a multi-column expressions, then the ~
operator will be
applied to each column in turn.
Bitwise NOT can only be applied to integer or boolean columns. The resulting column will have the same stype as its argument.
When the argument x
is a boolean column, then ~x
is equivalent to
logical NOT. This can be used to negate a condition, similar to python
operator not
(which is not overloadable).
Parameters¶
FExpr
Either an FExpr
, or any object that can be converted into FExpr
.
FExpr
An expression that evaluates ~x
.
Notes¶
Note
Use ~x
in order to negate a boolean FExpr. Using standard Python
keyword not
will result in an error.
datatable.FExpr.__le__()¶
Compare whether x <= y
.
Like all other FExpr operators, the less-than-or-equal operator is
elementwise: it produces a column where each element is the result of
comparison x[i] <= y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the <=
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The less-than-or-equal operator can be applied to columns of any type,
and the types of x
and y
are allowed to be different. In the
latter case the columns will be converted into a common stype before the
comparison.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x <= y
. The produced column will
have stype bool8
.
datatable.FExpr.__lshift__()¶
Shift x
by y
bits to the left, i.e. x << y
. Mathematically this
is equivalent to \(x\cdot 2^y\).
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the <<
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x << y
.
See also¶
__rshift__(x, y)
– right shift.
datatable.FExpr.__lt__()¶
Compare whether x < y
.
Like all other FExpr operators, the less-than operator is elementwise:
it produces a column where each element is the result of comparison
x[i] < y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the <
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The less-than operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x < y
. The produced column will
have stype bool8
.
datatable.FExpr.__mod__()¶
Compute the remainder of division of two FExprs, i.e. x % y
.
The modulus and integer division together satisfy the identity
that x == (x // y) * y + (x % y)
for all non-zero values of y
.
In addition, the result of x % y
is always in the range
[0; y)
for positive y
, and in the range (y; 0]
for
negative y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the %
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The integer division operation can only be applied to integer columns.
The resulting column will have stype equal to the largest of the stypes
of both columns, but at least int32
.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x % y
.
datatable.FExpr.__mul__()¶
Multiply two FExprs together, which corresponds to python operator *
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the *
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The multiplication operation can only be applied to numeric columns. The
resulting column will have stype equal to the larger of the stypes of its
arguments, but at least int32
.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x * y
.
datatable.FExpr.__ne__()¶
Compare whether values in columns x
and y
are not equal.
Like all other FExpr operators, the equality operator is elementwise:
it produces a column where each element is the result of comparison
x[i] != y[i]
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the !=
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The inequality operator can be applied to columns of any type, and the
types of x
and y
are allowed to be different. In the latter
case the columns will be converted into a common stype before the
comparison.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x != y
. The produced column will
have stype bool8
.
datatable.FExpr.__neg__()¶
Unary minus, which corresponds to python operation -x
.
If x
is a multi-column expressions, then the -
operator will be
applied to each column in turn.
Unary minus can only be applied to numeric columns. The resulting column
will have the same stype as its argument, but not less than int32
.
Parameters¶
FExpr
Either an FExpr
, or any object that can be converted into FExpr
.
FExpr
An expression that evaluates -x
.
datatable.FExpr.__or__()¶
Compute bitwise OR of x
and y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the |
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The OR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.
When both x
and y
are boolean, then the bitwise OR operator is
equivalent to logical OR. This can be used to combine several logical
conditions into a compound (since Python doesn’t allow overloading of
operator or
). Beware, however, that |
has higher precedence
than or
, so it is advisable to always use parentheses:
DT[(f.x < -1) | (f.x > 1), :]
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x | y
.
Notes¶
Note
Use x | y
in order to OR two boolean FExpr
s. Using standard
Python keyword or
will result in an error.
datatable.FExpr.__pos__()¶
Unary plus, which corresponds to python operation +x
.
If x
is a multi-column expressions, then the +
operator will be
applied to each column in turn.
Unary plus can only be applied to numeric columns. The resulting column
will have the same stype as its argument, but not less than int32
.
Parameters¶
FExpr
Either an FExpr
, or any object that can be converted into FExpr
.
FExpr
An expression that evaluates +x
.
datatable.FExpr.__pow__()¶
Raise x
to the power y
, or in math notation \(x^y\).
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the **
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The power operator can only be applied to numeric columns, and the
resulting column will have stype float64
in all cases except when both
arguments are float32
(in which case the result is also float32
).
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x ** y
.
datatable.FExpr.__repr__()¶
Return string representation of this object. This method is used
by Python’s built-in function repr()
.
The returned string has the following format:
"FExpr<...>"
where ...
will attempt to match the expression used to construct
this FExpr
.
Examples¶
repr(3 + 2*(f.A + f["B"]))
datatable.FExpr.__rshift__()¶
Shift x
by y
bits to the right, i.e. x >> y
. Mathematically this
is equivalent to \(\lfloor x\cdot 2^{-y} \rfloor\).
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the >>
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x >> y
.
See also¶
__lshift__(x, y)
– left shift.
datatable.FExpr.__sub__()¶
Subtract two FExprs, which corresponds to python operation x - y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the -
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The subtraction operation can only be applied to numeric columns. The
resulting column will have stype equal to the larger of the stypes of its
arguments, but at least int32
.
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x - y
.
datatable.FExpr.__truediv__()¶
Divide two FExprs, which corresponds to python operation x / y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the /
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The division operation can only be applied to numeric columns. The
resulting column will have stype float64
in all cases except when both
arguments have stype float32
(in which case the result is also
float32
).
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x / y
.
datatable.FExpr.__xor__()¶
Compute bitwise XOR of x
and y
.
If x
or y
are multi-column expressions, then they must have the
same number of columns, and the ^
operator will be applied to each
corresponding pair of columns. If either x
or y
are single-column
while the other is multi-column, then the single-column expression
will be repeated to the same number of columns as its opponent.
The XOR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.
When both x
and y
are boolean, then the bitwise XOR operator is
equivalent to logical XOR. This can be used to combine several logical
conditions into a compound (since Python doesn’t allow overloading of
operator xor
). Beware, however, that ^
has higher precedence
than xor
, so it is advisable to always use parentheses:
DT[(f.x == 0) ^ (f.y == 0), :]
Parameters¶
FExpr
The arguments must be either FExpr
s, or expressions that can be
converted into FExpr
s.
FExpr
An expression that evaluates x ^ y
.
datatable.FExpr.count()¶
Equivalent to dt.count(self)
.
datatable.FExpr.extend()¶
Append FExpr
arg
to the current FExpr.
Each FExpr
represents a collection of columns, or a columnset. This
method takes two such columnsets and combines them into a single one,
similar to cbind()
.
datatable.FExpr.first()¶
Equivalent to dt.first(self)
.
datatable.FExpr.last()¶
Equivalent to dt.last(self)
.
datatable.FExpr.len()¶
This method is deprecated and will be removed in version 1.1.0.
Please use dt.str.len()
instead.
datatable.FExpr.max()¶
Equivalent to dt.max(self)
.
datatable.FExpr.mean()¶
Equivalent to dt.mean(self)
.
datatable.FExpr.median()¶
Equivalent to dt.median(self)
.
datatable.FExpr.min()¶
Equivalent to dt.min(self)
.
datatable.FExpr.re_match()¶
This method is deprecated and will be removed in version 1.1.0.
Please use dt.re.match()
instead.
datatable.FExpr.remove()¶
Remove columns arg
from the current FExpr.
Each FExpr
represents a collection of columns, or a columnset. Some
of those columns are computed while others are specified “by reference”,
for example f.A
, f[:3]
or f[int]
. This method allows you to
remove by-reference columns from an existing FExpr.
datatable.FExpr.rowall()¶
Equivalent to dt.rowall(self)
.
datatable.FExpr.rowany()¶
Equivalent to dt.rowany(self)
.
datatable.FExpr.rowcount()¶
Equivalent to dt.rowcount(self)
.
datatable.FExpr.rowfirst()¶
Equivalent to dt.rowfirst(self)
.
datatable.FExpr.rowlast()¶
Equivalent to dt.rowlast(self)
.
datatable.FExpr.rowmax()¶
Equivalent to dt.rowmax(self)
.
datatable.FExpr.rowmean()¶
Equivalent to dt.rowmean(self)
.
datatable.FExpr.rowmin()¶
Equivalent to dt.rowmin(self)
.
datatable.FExpr.rowsd()¶
Equivalent to dt.rowsd(self)
.
datatable.FExpr.rowsum()¶
Equivalent to dt.rowsum(self)
.
datatable.FExpr.sd()¶
Equivalent to dt.sd(self)
.
datatable.FExpr.shift()¶
Equivalent to dt.shift(self, n)
.
datatable.FExpr.sum()¶
Equivalent to dt.sum(self)
.
datatable.Frame¶
Two-dimensional column-oriented container of data. This the primary
data structure in the datatable
module.
A Frame is two-dimensional in the sense that it is comprised of
rows and columns of data. Each data cell can be located via a pair
of its coordinates: (irow, icol)
. We do not support frames with
more or less than two dimensions.
A Frame is column-oriented in the sense that internally the data is stored separately for each column. Each column has its own name and type. Types may be different for different columns but cannot vary within each column.
Thus, the dimensions of a Frame are not symmetrical: a Frame is not a matrix. Internally the class is optimized for the use case when the number of rows significantly exceeds the number of columns.
A Frame can be viewed as a list
of columns: standard Python
function len()
will return the number of columns in the Frame,
and frame[j]
will return the column at index j
(each “column”
will be a Frame with ncols == 1
). Similarly, you can iterate over
the columns of a Frame in a loop, or use it in a *
-expansion:
for column in frame:
# do something
list_of_columns = [*frame]
A Frame can also be viewed as a dict
of columns, where the key
associated with each column is its name. Thus, frame[name]
will
return the column with the requested name. A Frame can also work with
standard python **
-expansion:
dict_of_columns = {**frame}
Construction¶
Construct the frame from various Python sources. |
|
Read an external file and convert into a Frame. |
|
Create a copy of the frame. |
Properties¶
The primary key for the Frame, if any. |
|
Logical types ( |
|
The frame’s meta information. |
|
The names of all columns in the frame. |
|
Number of columns in the frame. |
|
Number of rows in the frame. |
|
A tuple (number of rows, number of columns). |
|
Where this frame was loaded from. |
|
The common |
|
Storage types ( |
|
The common type ( |
|
types ( |
Frame manipulation¶
Primary method for extracting data from a frame. |
|
Update data within the frame. |
|
Remove rows/columns/values from the frame. |
|
Append columns of other frames to this frame. |
|
Append other frames at the bottom of the current. |
|
Search and replace values in the frame. |
|
Sort the frame by the specified columns. |
Convert into other formats¶
Convert the frame into an Arrow table. |
|
Write the frame’s data into CSV format. |
|
Convert the frame into a Python dictionary, by columns. |
|
Store the frame’s data into a binary file in Jay format. |
|
Return the frame’s data as a list of lists, by columns. |
|
Convert the frame into a numpy array. |
|
Convert the frame into a pandas DataFrame. |
|
Return the frame’s data as a list of tuples, by rows. |
Statistical methods¶
Count missing values for each column in the frame. |
|
Count missing values for a one-column frame and return it as a scalar. |
|
Calculate excess kurtosis for each column in the frame. |
|
Calculate excess kurtosis for a one-column frame and return it as a scalar. |
|
Find the largest element for each column in the frame. |
|
Find the largest element for a one-column frame and return it as a scalar. |
|
Calculate the mean value for each column in the frame. |
|
Calculate the mean value for a one-column frame and return it as a scalar. |
|
Find the smallest element for each column in the frame. |
|
Find the smallest element for a one-column frame and return it as a scalar. |
|
Find the mode value for each column in the frame. |
|
Find the mode value for a one-column frame and return it as a scalar. |
|
Calculate the modal frequency for each column in the frame. |
|
Calculate the modal frequency for a one-column frame and return it as a scalar. |
|
Count the number of unique values for each column in the frame. |
|
Count the number of unique values for a one-column frame and return it as a scalar. |
|
Calculate the standard deviation for each column in the frame. |
|
Calculate the standard deviation for a one-column frame and return it as a scalar. |
|
Calculate skewness for each column in the frame. |
|
Calculate skewness for a one-column frame and return it as a scalar. |
|
Calculate the sum of all values for each column in the frame. |
|
Calculate the sum of all values for a one-column column frame and return it as a scalar. |
Miscellaneous methods¶
Find the position of a column in the frame by its name. |
|
Create python variables for each column of the frame. |
|
Return the first few rows of the frame. |
|
Make sure all frame’s data is physically written to memory. |
|
Return the last few rows of the frame. |
Special methods¶
These methods are not intended to be called manually, instead they provide
a way for datatable
to interoperate with other Python modules or
builtin functions.
Used by Python module |
|
Used by Python module |
|
Method that implements the |
|
Method that implements the |
|
Used by Python module |
|
The constructor function. |
|
Used by Python function |
|
Used by Python function |
|
Used by Python function |
|
Used by Python function |
|
Method that implements the |
|
Used by Python module |
|
Used by |
|
Used by Python function |
|
|
Used to display the frame in Jupyter Lab. |
|
Used to display the frame in an |
datatable.Frame.__init__()¶
Create a new Frame from a single or multiple sources.
Argument _data
(or **cols
) contains the source data for Frame’s
columns. Column names are either derived from the data, given
explicitly via the names
argument, or generated automatically.
Either way, the constructor ensures that column names are unique,
non-empty, and do not contain certain special characters (see
Name mangling for details).
Parameters¶
Any
The first argument to the constructor represents the source from
which to construct the Frame. If this argument is given, then the
varkwd arguments **cols
should not be used.
This argument can accept a wide range of data types; see the “Details” section below.
List[str|None]
Explicit list (or tuple) of column names. The number of elements in the list must be the same as the number of columns being constructed.
This parameter should not be used when constructing the frame
from **cols
.
List[Type]
| Dict[str, Type]
Explicit list (or dict) of column types. The number of elements in the list must be the same as the number of columns being constructed.
Frame
A Frame
object is constructed and
returned.
Details¶
The shape of the constructed Frame depends on the type of the source
argument _data
(or **cols
). The argument _data
and varkwd
arguments **cols
are mutually exclusive: they cannot be used at the
same time. However, it is possible to use neither and construct an
empty frame:
dt.Frame() # empty 0x0 frame
dt.Frame(None) # same
dt.Frame([]) # same
The varkwd arguments **cols
can be used to construct a Frame by
columns. In this case the keys become column names, and the values
are column initializers. This form is mostly used for convenience;
it is equivalent to converting cols
into a dict
and passing as
the first argument:
dt.Frame(A = range(7),
B = [0.1, 0.3, 0.5, 0.7, None, 1.0, 1.5],
C = ["red", "orange", "yellow", "green", "blue", "indigo", "violet"])
# equivalent to
dt.Frame({"A": range(7),
"B": [0.1, 0.3, 0.5, 0.7, None, 1.0, 1.5],
"C": ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]})
A | B | C | ||
---|---|---|---|---|
int32 | float64 | str32 | ||
0 | 0 | 0.1 | red | |
1 | 1 | 0.3 | orange | |
2 | 2 | 0.5 | yellow | |
3 | 3 | 0.7 | green | |
4 | 4 | NA | blue | |
5 | 5 | 1 | indigo | |
6 | 6 | 1.5 | violet |
The argument _data
accepts a wide range of input types. The
following list describes possible choices:
List[List | Frame | np.array | pd.DataFrame | pd.Series | range | typed_list]
When the source is a non-empty list containing other lists or compound objects, then each item will be interpreted as a column initializer, and the resulting frame will have as many columns as the number of items in the list.
Each element in the list must produce a single column. Thus, it is not allowed to use multi-column
Frame
s, or multi-dimensional numpy arrays or pandasDataFrame
s.dt.Frame([[1, 3, 5, 7, 11], [12.5, None, -1.1, 3.4, 9.17]])
C0 C1 int32 float64 0 1 12.5 1 3 NA 2 5 -1.1 3 7 3.4 4 11 9.17 Note that unlike
pandas
andnumpy
, we treat a list of lists as a list of columns, not a list of rows. If you need to create a Frame from a row-oriented store of data, you can use a list of dictionaries or a list of tuples as described below.List[Dict]
If the source is a list of
dict
objects, then each element in this list is interpreted as a single row. The keys in each dictionary are column names, and the values contain contents of each individual cell.The rows don’t have to have the same number or order of entries: all missing elements will be filled with NAs:
dt.Frame([{"A": 3, "B": 7}, {"A": 0, "B": 11, "C": -1}, {"C": 5}])
A B C int32 int32 int32 0 3 7 NA 1 0 11 -1 2 NA NA 5 If the
names
parameter is given, then only the keys given in the list of names will be taken into account, all extra fields will be discarded.List[Tuple]
If the source is a list of
tuple
s, then each tuple represents a single row. The tuples must have the same size, otherwise an exception will be raised:dt.Frame([(39, "Mary"), (17, "Jasmine"), (23, "Lily")], names=['age', 'name'])
age name int32 str32 0 39 Mary 1 17 Jasmine 2 23 Lily If the tuples are in fact
namedtuple
s, then the field names will be used for the column names in the resulting Frame. No check is made whether the named tuples in fact belong to the same class.List[Any]
If the list’s first element does not match any of the cases above, then it is considered a “list of primitives”. Such list will be parsed as a single column.
The entries are typically
bool
s,int
s,float
s,str
s, orNone
s; numpy scalars are also allowed. If the list has elements of heterogeneous types, then we will attempt to convert them to the smallest common stype.If the list contains only boolean values (or
None
s), then it will create a column of typebool8
.If the list contains only integers (or
None
s), then the resulting column will beint8
if all integers are 0 or 1; orint32
if all entries are less than \(2^{31}\) in magnitude; otherwiseint64
if all entries are less than \(2^{63}\) in magnitude; or otherwisefloat64
.If the list contains floats, then the resulting column will have stype
float64
. BothNone
andmath.nan
can be used to input NA values.Finally, if the list contains strings then the column produced will have stype
str32
if the total size of the character is less than 2Gb, orstr64
otherwise.typed_list
A typed list can be created by taking a regular list and dividing it by an stype. It behaves similarly to a simple list of primitives, except that it is parsed into the specific stype.
dt.Frame([1.5, 2.0, 3.87] / dt.float32).type
Type.float32Dict[str, Any]
The keys are column names, and values can be any objects from which a single-column frame can be constructed: list, range, np.array, single-column Frame, pandas series, etc.
Constructing a frame from a dictionary
d
is exactly equivalent to callingdt.Frame(list(d.values()), names=list(d.keys()))
.range
Same as if the range was expanded into a list of integers, except that the column created from a range is virtual and its creation time is nearly instant regardless of the range’s length.
Frame
If the argument is a
Frame
, then a shallow copy of that frame will be created, same as.copy()
.str
If the source is a simple string, then the frame is created by
fread
-ing this string. In particular, if the string contains the name of a file, the data will be loaded from that file; if it is a URL, the data will be downloaded and parsed from that URL. Lastly, the string may simply contain a table of data.DT1 = dt.Frame("train.csv") dt.Frame(""" Name Age Mary 39 Jasmine 17 Lily 23 """)
Name Age str32 int32 0 Mary 39 1 Jasmine 17 2 Lily NA pd.DataFrame | pd.Series
A pandas DataFrame (Series) will be converted into a datatable Frame. Column names will be preserved.
Column types will generally be the same, assuming they have a corresponding stype in datatable. If not, the column will be converted. For example, pandas date/time column will get converted into string, while
float16
will be converted intofloat32
.If a pandas frame has an object column, we will attempt to refine it into a more specific stype. In particular, we can detect a string or boolean column stored as object in pandas.
np.array
A numpy array will get converted into a Frame of the same shape (provided that it is 2- or less- dimensional) and the same type.
If possible, we will create a Frame without copying the data (however, this is subject to numpy’s approval). The resulting frame will have a copy-on-write semantics.
pyarrow.Table
An arrow table will be converted into a datatable Frame, preserving column names and types.
If the arrow table has columns of types not supported by datatable (for example lists or structs), an exception will be raised.
None
When the source is not given at all, then a 0x0 frame will be created; unless a
names
parameter is provided, in which case the resulting frame will have 0 rows but as many columns as given in thenames
list.
datatable.Frame.__copy__()¶
This method facilitates copying of a Frame via the python standard module
copy
. See .copy()
for more details.
datatable.Frame.__delitem__()¶
This methods deletes rows and columns that would have been selected from
the frame if not for the del
keyword.
All parameters have the same meaning as in the getter
DT[i, j, ...]
, with the only
restriction that j
must select columns from the main frame only (i.e. not
from the joined frame(s)), and it must select them by reference. Selecting
by reference means it should be possible to tell where each column was in
the original frame.
There are several modes of delete operation, depending on whether i
or
j
are “slice-all” symbols:
datatable.Frame.__getitem__()¶
The main method for accessing data and computing on the frame. Sometimes
we also refer to it as the DT[i, j, ...]
call.
Since Python does not support keyword arguments inside square brackets,
all arguments are positional. The first is the row selector i
, the
second is the column selector j
, and the rest are optional. Thus,
DT[i, j]
selects rows i
and columns j
from frame DT
.
If an additional by
argument is present, then the selectors i
and
j
work within groups generated by the by()
expression. The sort
argument reorders the rows of the frame, and the join
argument allows
performing SQL joins between several frames.
The signature listed here is the most generic. But there are also
special-case signatures DT[j]
and DT[i, j]
described below.
Parameters¶
int
| slice
| Frame
| FExpr
| List[bool]
| List[Any]
The row selector.
If this is an integer or a slice, then the behavior is the same as in
Python when working on a list with .nrows
elements. In particular, the integer value must be within the range
[-nrows; nrows)
. On the other hand when i
is a slice, then either
its start or end or both may be safely outside the row-range of the
frame. The trivial slice :
always selects all rows.
i
may also be a single-column boolean Frame. It must have the
same number of rows as the current frame, and it serves as a mask for
which rows are to be selected: True
indicates that the row should
be included in the result, while False
and None
skips the row.
i
may also be a single-column integer Frame. Such column specifies
directly which row indices are to be selected. This is more flexible
compared to a boolean column: the rows may be repeated, reordered,
omitted, etc. All values in the column i
must be in the range
[0; nrows)
or an error will be thrown. In particular, negative
indices are not allowed. Also, if the column contains NA values, then
it would produce an “invalid row”, i.e. a row filled with NAs.
i
may also be an expression, which must evaluate into a single
column, either boolean or integer. In this case the result is the
same as described above for a single-column frame.
When i
is a list of booleans, then it is equivalent to a single-column
boolean frame. In particular, the length of the list must be equal to
.nrows
.
Finally, i
can be a list of any of the above (integers, slices, frames,
expressions, etc), in which case each element of the list is evaluated
separately and then all selected rows are put together. The list may
contain None
s, which will be simply skipped.
int
| str
| slice
| list
| dict
| type
| FExpr
| update
This argument may either select columns, or perform computations with the columns.
int
Select a single column at the specified index. A
dt.exceptions.IndexError
is raised ifj
is not in the range[-ncols; ncols)
.str
Select a single column by name. A
dt.exceptions.KeyError
is raised if the column with such a name does not exist.:
This is a trivial slice, and it means “select everything”, and is roughly equivalent to SQL’s
*
. In the simple case ofDT[i, j]
call “selecting everything” means all columns from frameDT
. However, when theby()
clause is added, then:
will now select all columns except those used in the groupby. And if the expression has ajoin()
, then “selecting everything” will produce all columns from all frames, excluding those that were duplicate during a natural join.slice[int]
An integer slice can be used to select a subset of columns. The behavior of a slice is exactly the same as in base Python.
slice[str]
A string slice is an expression like
"colA":"colZ"
. In this case all columns from"colA"
to"colZ"
inclusive are selected. And if"colZ"
appears before"colA
” in the frame, then the returned columns will be in the reverse order.Both endpoints of the slice must be valid columns (or omitted), or otherwise a
dt.exceptions.KeyError
will be raised.type
|stype
|ltype
Select only columns of the matching type.
FExpr
An expression formula is computed within the current evaluation context (i.e. it takes into account the current frame, the filter
i
, the presence of groupby/join parameters, etc). The result of this evaluation is used as-if that colum existed in the frame.List[bool]
If
j
is a list of boolean values, then it must have the length of.ncols
, and it describes which columns are to be selected into the result.List[Any]
The
j
can also be a list of elements of any other type listed above, with the only restriction that the items must be homogeneous. For example, you can mixint
s andslice[int]
s, but notint
s andFExpr
s, orint
s andstr
s.Each item in the list will be evaluated separately (as if each was the sole element in
j
), and then all the results will be put together.Dict[str, FExpr]
A dictionary can be used to select columns/expressions similarly to a list, but assigning them explicit names.
update
As a special case, the
j
argument may be theupdate()
function, which turns the selection operation into an update. That is, instead of returning the chosen rows/columns, they will be updated instead with the user-supplied values.
by
When by()
clause is present in the square brackets, the rest of the
computations are carried out within the “context of a groupby”. This
should generally be equivalent to (a) splitting the frame into separate
sub-frames corresponding to each group, (b) applying DT[i, j]
separately within each group, (c) row-binding the results for each
group. In practice the following operations are affected:
all reduction operators such as
dt.min()
ordt.sum()
now work separately within each group. Thus, instead of computing sum over the entire column, it is computed separately within each group inby()
, and the resulting column will have as many rows as the number of groups.certain
i
expressions are re-interpreted as being applied within each group. For example, ifi
is an integer or a slice, then it will now be selecting row(s) within each group.certain functions (such as
dt.shift()
) are also “group-aware”, and produce results that take into account the groupby context. Check documentation for each individual function to find out whether it has special treatment for groupby contexts.
In addition, by()
also affects the order of columns in the output
frame. Specifically, all columns listed as the groupby keys will be
automatically placed at the front of the resulting frame, and also
excluded from :
or f[:]
within j
.
sort
This argument can be used to rearrange rows in the resulting frame.
See sort()
for details.
join
Performs a JOIN operation with another frame. The
join()
clause will calculate how the rows
of the current frame match against the rows of the joined frame, and
allow you to refer to the columns of the joined frame within i
, j
or by
. In order to access columns of the joined frame use
namespace g.
.
This parameter may be listed multiple times if you need to join with several frames.
Details¶
The order of evaluation of expressions is that first the join
clause(s)
are computed, creating a mapping between the rows of the current frame and
the joined frame(s). After that we evaluate by
+sort
. Next, the i
filter is applied creating the final index of rows that will be selected.
Lastly, we evaluate the j
part, taking into account the current groupby
and row index(es).
When evaluating j
, it is essentially converted into a tree (DAG) of
expressions, where each expression is evaluated from the bottom up. That
is, we start evaluating from the leaf nodes (which are usually column
selectors such as f[0]
), and then at each convert the set of columns
into a new set. Importantly, each subexpression node may produce columns
of 3 types: “scalar”, “grouped”, and “full-size”. Whenever subexpressions
of different levels are mixed together, they are upgraded to the highest
level. Thus, a scalar may be reused for each group, and a grouped column
can interoperate with a regular column by auto-expanding in such a way
that it becomes constant within each group.
If, after the j
is fully evaluated, it produces a column set of type
“grouped”, then the resulting frame will have as many rows as there are
groups. If, on the other hand, the column set is “full-size”, then the
resulting frame will have as many rows as the original frame.
See Also¶
DT[i, j, ...] = R
– update values in the frame.del DT[i, j, ...]
– delete rows/columns of the frame.
Extract a single column j
from the frame.
The single-argument version of DT[i, j]
works only for j
being
either an integer (indicating column index) or a string (column name).
If you need any other way of addressing column(s) of the frame, use the
more versatile DT[:, j]
form.
Parameters¶
int
| str
The index or name of a column to retrieve.
Frame
Single-column frame containing the column at the specified index or with the given name.
KeyError
| IndexError
raised if the column with the given name does not exist in the frame. |
|
raised if the column does not exist at the provided
index |
datatable.Frame.__getstate__()¶
This method allows the frame to be pickle
-able.
Pickling a Frame involves saving it into a bytes
object in Jay format,
but may be less efficient than saving into a file directly because Python
creates a copy of the data for the bytes object.
See .to_jay()
for more details and caveats about saving into Jay
format.
datatable.Frame.__iter__()¶
Returns an iterator over the frame’s columns.
The iterator is a light-weight object of type frame_iterator
,
which yields consequent columns of the frame with each iteration.
Thus, the iterator produces the sequence frame[0], frame[1],
frame[2], ...
until the end of the frame. This works even if
the user adds or deletes columns in the frame while iterating.
Be careful when inserting/deleting columns at an index that was
already iterated over, as it will cause some columns to be
skipped or visited more than once.
This method is not intended for manual use. Instead, it is
invoked by Python runtime either when you call iter()
,
or when you use the frame in a loop:
for column in frame:
# column is a Frame of shape (frame.nrows, 1)
...
See Also¶
datatable.Frame.__len__()¶
datatable.Frame.__repr__()¶
Returns a simple representation of the frame as a string. This
method is used by Python’s built-in function repr()
.
The returned string has the following format:
f"<Frame#{ID} {nrows}x{ncols}>"
where {ID}
is the value of id(frame)
in hex format. Thus,
each frame has its own unique id, though after one frame is
deleted its id may be reused by another frame.
See Also¶
datatable.Frame.__reversed__()¶
Returns an iterator over the frame’s columns in reverse order.
This is similar to .__iter__()
, except that the columns
are returned in the reverse order, i.e. frame[-1]
, frame[-2]
,
frame[-3]
, etc.
This function is not intended for manual use. Instead, it is
invoked by Python builtin function reversed()
.
datatable.Frame.__setitem__()¶
This methods updates values within the frame, or adds new columns to the frame.
All parameters have the same meaning as in the getter
DT[i, j, ...]
, with the only
restriction that j
must select to columns by reference (i.e. there
could not be any computed columns there). On the other hand, j
may
contain columns that do not exist in the frame yet: these columns will be
created.
Parameters¶
...
Row selector.
...
Column selector. Computed columns are forbidden, but not-existing (new) columns are allowed.
by
Groupby condition.
join
Join criterion.
FExpr
| List[FExpr]
| Frame
| type
| None
| bool
| int
| float
| str
The replacement for the selection on the left-hand-side.
None
|bool
|int
|float
|str
A simple python scalar can be assigned to any-shape selection on the LHS. If
i
selects all rows (i.e. the assignment is of the formDT[:, j] = R
), then each column inj
will be replaced with a constant column containing the valueR
.If, on the other hand,
i
selects only some rows, then the type ofR
must be consistent with the type of column(s) selected inj
. In this case only cells in subset[i, j]
will be updated with the value ofR
; the columns may be promoted within their ltype if the value ofR
is large in magnitude.type
|stype
|ltype
Assigning a type to one or more columns will change the types of those columns. The row selector
i
must be “slice-all”:
.Frame
|FExpr
|List[FExpr]
When a frame or an expression is assigned, then the shape of the RHS must match the shape of the LHS. Similarly to the assignment of scalars, types must be compatible when assigning to a subset of rows.
See Also¶
dt.update()
– An alternative way to update values in the frame withinDT[i, j]
getter..replace()
– Search and replace for certain values within the entire frame.
A simplified form of the setter, suitable for a single-column replacement.
In this case j
may only be an integer or a string.
datatable.Frame.__sizeof__()¶
Return the size of this Frame in memory.
The function attempts to compute the total memory size of the frame as precisely as possible. In particular, it takes into account not only the size of data in columns, but also sizes of all auxiliary internal structures.
Special cases: if frame is a view (say, d2 = DT[:1000, :]
), then
the reported size will not contain the size of the data, because that
data “belongs” to the original datatable and is not copied. However if
a frame selects only a subset of columns (say, d3 = DT[:, :5]
),
then a view is not created and instead the columns are copied by
reference. Frame d3
will report the “full” size of its columns,
even though they do not occupy any extra memory compared to DT
.
This behavior may be changed in the future.
This function is not intended for manual use. Instead, in order to
get the size of a frame DT
, call sys.getsizeof(DT)
.
datatable.Frame.__str__()¶
Returns a string with the Frame’s data formatted as a table, i.e. the same representation as displayed when trying to inspect the frame from Python console.
Different aspects of the stringification process can be controlled
via dt.options.display
options; but under the default settings
the returned string will be sufficiently small to fit into a
typical terminal window. If the frame has too many rows/columns,
then only a small sample near the start+end of the frame will be
rendered.
See Also¶
datatable.Frame.cbind()¶
Append columns of one or more frames
to the current Frame.
For example, if the current frame has n
columns, and you are
appending another frame with k
columns, then after this method
succeeds, the current frame will have n + k
columns. Thus, this
method is roughly equivalent to pandas.concat(axis=1)
.
The frames being cbound must all either have the same number of rows, or some of them may have only a single row. Such single-row frames will be automatically expanded, replicating the value as needed. This makes it easy to create constant columns or to append reduction results (such as min/max/mean/etc) to the current Frame.
If some of the frames
have an incompatible number of rows, then the
operation will fail with an dt.exceptions.InvalidOperationError
.
However, if you set the flag force
to True, then the error will no
longer be raised - instead all frames that are shorter than the others
will be padded with NAs.
If the frames being appended have the same column names as the current frame, then those names will be mangled to ensure that the column names in the current frame remain unique. A warning will also be issued in this case.
Parameters¶
Frame
| List[Frame]
| None
The list/tuple/sequence/generator expression of Frames to append
to the current frame. The list may also contain None
values,
which will be simply skipped.
bool
If True, allows Frames to be appended even if they have unequal number of rows. The resulting Frame will have number of rows equal to the largest among all Frames. Those Frames which have less than the largest number of rows, will be padded with NAs (with the exception of Frames having just 1 row, which will be replicated instead of filling with NAs).
None
This method alters the current frame in-place, and doesn’t return anything.
InvalidOperationError
If trying to cbind frames with the number of rows different from
the current frame’s, and the option force
is not set.
Notes¶
Cbinding frames is a very cheap operation: the columns are copied by
reference, which means the complexity of the operation depends only
on the number of columns, not on the number of rows. Still, if you
are planning to cbind a large number of frames, it will be beneficial
to collect them in a list first and then call a single cbind()
instead of cbinding them one-by-one.
It is possible to cbind frames using the standard DT[i,j]
syntax:
df[:, update(**frame1, **frame2, ...)]
Or, if you need to append just a single column:
df["newcol"] = frame1
Examples¶
DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0])
frame1 = dt.Frame(N=[-1, -2, -5])
DT.cbind(frame1)
DT
A | B | N | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | 4 | -1 | |
1 | 2 | 7 | -2 | |
2 | 3 | 0 | -5 |
See also¶
dt.cbind()
– function for cbinding frames “out-of-place” instead of in-place;.rbind()
– method for row-binding frames.
datatable.Frame.colindex()¶
Return the position of the column
in the Frame.
The index of the first column is 0
, just as with regular python
lists.
Parameters¶
str
| int
| FExpr
If string, then this is the name of the column whose index you want to find.
If integer, then this represents a column’s index. The return
value is thus the same as the input argument column
, provided
that it is in the correct range. If the column
argument is
negative, then it is interpreted as counting from the end of the
frame. In this case the positive value column + ncols
is
returned.
Lastly, column
argument may also be an
f-expression such as f.A
or f[3]
. This
case is treated as if the argument was simply "A"
or 3
. More
complicated f-expressions are not allowed and will result in a
TypeError
.
int
The numeric index of the provided column
. This will be an
integer between 0
and self.ncols - 1
.
KeyError
| IndexError
raised if the |
|
raised if the |
Examples¶
df = dt.Frame(A=[3, 14, 15], B=["enas", "duo", "treis"],
C=[0, 0, 0])
df.colindex("B")
df.colindex(-1)
from datatable import f
df.colindex(f.A)
datatable.Frame.copy()¶
Make a copy of the frame.
The returned frame will be an identical copy of the original, including column names, types, and keys.
By default, copying is shallow with copy-on-write semantics. This means that only the minimal information about the frame is copied, while all the internal data buffers are shared between the copies. Nevertheless, due to the copy-on-write semantics, any changes made to one of the frames will not propagate to the other; instead, the data will be copied whenever the user attempts to modify it.
It is also possible to explicitly request a deep copy of the frame
by setting the parameter deep
to True
. With this flag, the
returned copy will be truly independent from the original. The
returned frame will also be fully materialized in this case.
Parameters¶
bool
Flag indicating whether to return a “shallow” (default), or a “deep” copy of the original frame.
Frame
A new Frame, which is the copy of the current frame.
Examples¶
DT1 = dt.Frame(range(5))
DT2 = DT1.copy()
DT2[0, 0] = -1
DT2
C0 | ||
---|---|---|
int32 | ||
0 | -1 | |
1 | 1 | |
2 | 2 | |
3 | 3 | |
4 | 4 |
DT1
C0 | ||
---|---|---|
int32 | ||
0 | 0 | |
1 | 1 | |
2 | 2 | |
3 | 3 | |
4 | 4 |
Notes¶
Non-deep frame copy is a very low-cost operation: its speed depends on the number of columns only, not on the number of rows. On a regular laptop copying a 100-column frame takes about 30-50µs.
Deep copying is more expensive, since the data has to be physically written to new memory, and if the source columns are virtual, then they need to be computed too.
Another way to create a copy of the frame is using a
DT[i, j]
expression (however, this will not copy the key property):DT[:, :]
Frame
class also supports copying via the standard Python librarycopy
:import copy DT_shallow_copy = copy.copy(DT) DT_deep_copy = copy.deepcopy(DT)
datatable.Frame.countna()¶
Report the number of NA values in each column of the frame.
Parameters¶
Frame
The frame will have one row and the same number/names of columns
as in the current frame. All columns will have stype int64
.
Examples¶
DT = dt.Frame(A=[1, 5, None], B=[math.nan]*3, C=[None, None, 'bah!'])
DT.countna()
A | B | C | ||
---|---|---|---|---|
int64 | int64 | int64 | ||
0 | 1 | 3 | 2 |
DT.countna().to_tuples()[0]
(1, 3, 2)
See Also¶
.countna1()
– similar to this method, but operates on a single-column frame only, and returns a number instead of a Frame.dt.count()
– function for counting non-NA (“valid”) values in a column; can also be applied per-group.
datatable.Frame.countna1()¶
Return the number of NA values in a single-column Frame.
This function is a shortcut for:
DT.countna()[0, 0]
See Also¶
.countna()
– similar to this method, but can be applied to a Frame with an arbitrary number of columns.dt.count()
– function for counting non-NA (“valid”) values in a column; can also be applied per-group.
datatable.Frame.export_names()¶
Return a tuple of f-expressions for all columns of the frame.
For example, if the frame has columns “A”, “B”, and “C”, then this
method will return a tuple of expressions (f.A, f.B, f.C)
. If you
assign these to, say, variables A
, B
, and C
, then you
will be able to write column expressions using the column names
directly, without using the f
symbol:
A, B, C = DT.export_names()
DT[A + B > C, :]
The variables that are “exported” refer to each column by name. This means that you can use the variables even after reordering the columns. In addition, the variables will work not only for the frame they were exported from, but also for any other frame that has columns with the same names.
Parameters¶
Tuple[Expr, ...]
The length of the tuple is equal to the number of columns in the
frame. Each element of the tuple is a datatable expression, and
can be used primarily with the DT[i,j]
notation.
Notes¶
This method is effectively equivalent to:
def export_names(self): return tuple(f[name] for name in self.names)
If you want to export only a subset of column names, then you can either subset the frame first, or use
*
-notation to ignore the names that you do not plan to use:A, B = DT[:, :2].export_names() # export the first two columns A, B, *_ = DT.export_names() # same
Variables that you use in code do not have to have the same names as the columns:
Price, Quantity = DT[:, ["sale price", "quant"]].export_names()
datatable.Frame.head()¶
Return the first n
rows of the frame.
If the number of rows in the frame is less than n
, then all rows
are returned.
This is a convenience function and it is equivalent to DT[:n, :]
.
Parameters¶
int
The maximum number of rows to return, 10 by default. This number cannot be negative.
Frame
A frame containing the first up to n
rows from the original
frame, and same columns.
Examples¶
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates",
"eggplants", "figs", "grapes", "kiwi"])
DT.head(4)
A | ||
---|---|---|
str32 | ||
0 | apples | |
1 | bananas | |
2 | cherries | |
3 | dates |
datatable.Frame.key¶
The tuple of column names that are the primary key for this frame.
If the frame has no primary key, this property returns an empty tuple.
The primary key columns are always located at the beginning of the frame, and therefore the following holds:
DT.key == DT.names[:len(DT.key)]
Assigning to this property will make the Frame keyed by the specified column(s). The key columns will be moved to the front, and the Frame will be sorted. The values in the key columns must be unique.
Parameters¶
Tuple[str, ...]
When used as a getter, returns the tuple of names of the primary key columns.
str
| List[str]
| Tuple[str, ...]
| None
Specify a column or a list of columns that will become the new primary key of the Frame. Object columns cannot be used for a key. The values in the key column must be unique; if multiple columns are assigned as the key, then their combined (tuple-like) values must be unique.
If new_key
is None
, then this is equivalent to deleting the
key. When the key is deleted, the key columns remain in the frame,
they merely stop being marked as “key”.
ValueError
Raised when the values in the key column(s) are not unique.
KeyError
Raised when new_key
contains a column name that doesn’t exist
in the Frame.
Examples¶
DT = dt.Frame(A=range(5), B=['one', 'two', 'three', 'four', 'five'])
DT.key = 'B'
DT
B | A | |
---|---|---|
str32 | int32 | |
five | 4 | |
four | 3 | |
one | 0 | |
three | 2 | |
two | 1 |
datatable.Frame.keys()¶
datatable.Frame.kurt()¶
Calculate the excess kurtosis for each column in the frame.
Parameters¶
Frame
The frame will have one row and the same number/names
of columns as in the current frame. All the columns
will have float64
stype. For non-numeric columns
this function returns NA values.
datatable.Frame.kurt1()¶
Calculate the excess kurtosis for a one-column frame and return it as a scalar.
This function is a shortcut for:
DT.kurt()[0, 0]
Parameters¶
None
| float
None
is returned for non-numeric columns.
ValueError
If called on a Frame that has more or less than one column.
datatable.Frame.ltypes¶
datatable.Frame.materialize()¶
Force all data in the Frame to be laid out physically.
In datatable, a Frame may contain “virtual” columns, i.e. columns whose data is computed on-the-fly. This allows us to have better performance for certain types of computations, while also reducing the total memory footprint. The use of virtual columns is generally transparent to the user, and datatable will materialize them as needed.
However, there could be situations where you might want to materialize your Frame explicitly. In particular, materialization will carry out all delayed computations and break internal references on other Frames’ data. Thus, for example if you subset a large frame to create a smaller subset, then the new frame will carry an internal reference to the original, preventing it from being garbage-collected. However, if you materialize the small frame, then the data will be physically copied, allowing the original frame’s memory to be freed.
Parameters¶
bool
If True, then, in addition to de-virtualizing all columns, this method will also copy all memory-mapped columns into the RAM.
When you open a Jay file, the Frame that is created will contain
memory-mapped columns whose data still resides on disk. Calling
.materialize(to_memory=True)
will force the data to be loaded
into the main memory. This may be beneficial if you are concerned
about the disk speed, or if the file is on a removable drive, or
if you want to delete the source file.
None
This operation modifies the frame in-place.
datatable.Frame.max()¶
Find the largest value in each column of the frame.
Parameters¶
Frame
The frame will have one row and the same number, names and stypes of columns as in the current frame. For string/object columns this function returns NA values.
datatable.Frame.max1()¶
Return the largest value in a single-column Frame. The frame’s stype must be numeric.
This function is a shortcut for:
DT.max()[0, 0]
Parameters¶
bool
| int
| float
The returned value corresponds to the stype of the frame.
ValueError
If called on a Frame that has more or less than one column.
datatable.Frame.mean()¶
Calculate the mean value for each column in the frame.
Parameters¶
Frame
The frame will have one row and the same number/names
of columns as in the current frame. All columns will have float64
stype. For string/object columns this function returns NA values.
datatable.Frame.mean1()¶
Calculate the mean value for a single-column Frame.
This function is a shortcut for:
DT.mean()[0, 0]
Parameters¶
None
| float
None
is returned for string/object columns.
ValueError
If called on a Frame that has more or less than one column.
datatable.Frame.meta¶
Frame’s meta information.
This property contains meta information, if any, as set by datatable functions and methods. It is a settable property, so that users can also update it with any information relevant to a particular frame.
It is not guaranteed that the existing meta information will be preserved by the functions and methods called on the frame. In particular, it is not preserved when exporting data into a Jay file or pickling the data. This behavior may change in the future.
The default value for this property is None
.
Parameters¶
dict
| None
If the frame carries any meta information, the corresponding meta
information dictionary is returned, None
is returned otherwise.
dict
| None
New meta information.
datatable.Frame.min()¶
Find the smallest value in each column of the frame.
Parameters¶
Frame
The frame will have one row and the same number, names and stypes of columns as in the current frame. For string/object columns this function returns NA values.
datatable.Frame.min1()¶
Find the smallest value in a single-column Frame. The frame’s stype must be numeric.
This function is a shortcut for:
DT.min()[0, 0]
Parameters¶
bool
| int
| float
The returned value corresponds to the stype of the frame.
ValueError
If called on a Frame that has more or less than 1 column.
datatable.Frame.mode()¶
datatable.Frame.mode1()¶
Find the mode for a single-column Frame.
This function is a shortcut for:
DT.mode()[0, 0]
Parameters¶
bool
| int
| float
| str
| object
The returned value corresponds to the stype of the column.
ValueError
If called on a Frame that has more or less than one column.
datatable.Frame.names¶
The tuple of names of all columns in the frame.
Each name is a non-empty string not containing any ASCII control characters, and jointly the names are unique within the frame.
This property is also assignable: setting DT.names
has the effect
of renaming the frame’s columns without changing their order. When
renaming, the length of the new list of names must be the same as the
number of columns in the frame. It is also possible to rename just a
few of the columns by assigning a dictionary {oldname: newname}
.
Any column not listed in the dictionary will keep its old name.
When setting new column names, we will verify whether they satisfy the requirements mentioned above. If not, a warning will be emitted and the names will be automatically mangled.
Parameters¶
Tuple[str, ...]
When used in getter form, this property returns the names of all
frame’s columns, as a tuple. The length of the tuple is equal to
the number of columns in the frame, .ncols
.
List[str?]
| Tuple[str?, ...]
| Dict[str, str?]
| None
The most common form is to assign the list or tuple of new
column names. The length of the new list must be equal to the
number of columns in the frame. Some (or all) elements in the list
may be None
’s, indicating that that column should have
an auto-generated name.
If new_names
is a dictionary, then it provides a mapping from
old to new column names. The dictionary may contain less entries
than the number of columns in the frame: the columns not mentioned
in the dictionary will retain their names.
Setting the .names
to None
is equivalent to using the
del
keyword: the names will be set to their default values,
which are usually C0, C1, ...
.
ValueError
| KeyError
raised If the length of the list/tuple |
|
raised If |
Examples¶
DT = dt.Frame([[1], [2], [3]])
DT.names = ['A', 'B', 'C']
DT.names
DT.names = {'B': 'middle'}
DT.names
del DT.names
DT.names
datatable.Frame.ncols¶
datatable.Frame.nmodal()¶
Calculate the modal frequency for each column in the frame.
Parameters¶
Frame
The frame will have one row and the same number/names
of columns as in the current frame. All the columns
will have int64
stype.
See Also¶
.nmodal1()
– similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.
datatable.Frame.nmodal1()¶
Calculate the modal frequency for a single-column Frame.
This function is a shortcut for:
DT.nmodal()[0, 0]
datatable.Frame.nrows¶
Number of rows in the Frame.
Assigning to this property will change the height of the Frame, either by truncating if the new number of rows is smaller than the current, or filling with NAs if the new number of rows is greater.
Increasing the number of rows of a keyed Frame is not allowed.
Parameters¶
int
The number of rows can be either zero or a positive integer.
int
The new number of rows for the frame, this should be a non-negative integer.
datatable.Frame.nunique()¶
Count the number of unique values for each column in the frame.
Parameters¶
Frame
The frame will have one row and the same number/names
of columns as in the current frame. All the columns
will have int64
stype.
See Also¶
.nunique1()
– similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.
datatable.Frame.nunique1()¶
Count the number of unique values for a one-column frame and return it as a scalar.
This function is a shortcut for:
DT.nunique()[0, 0]
See Also¶
.nunique()
– similar to this method, but can be applied to a Frame with an arbitrary number of columns.
datatable.Frame.rbind()¶
Append rows of frames
to the current frame.
This is equivalent to list.extend()
in Python: the frames are
combined by rows, i.e. rbinding a frame of shape [n x k] to a Frame
of shape [m x k] produces a frame of shape [(m + n) x k].
This method modifies the current frame in-place. If you do not want
the current frame modified, then use the dt.rbind()
function.
If frame(s) being appended have columns of types different from the current frame, then these columns will be promoted according to the standard promotion rules. In particular, booleans can be promoted into integers, which in turn get promoted into floats. However, they are not promoted into strings or objects.
If frames have columns of incompatible types, a TypeError will be raised.
If you need to append multiple frames, then it is more efficient to
collect them into an array first and then do a single rbind()
, than
it is to append them one-by-one in a loop.
Appending data to a frame opened from disk will force loading the current frame into memory, which may fail with an OutOfMemory exception if the frame is sufficiently big.
Parameters¶
Frame
| List[Frame]
One or more frames to append. These frames should have the same
columnar structure as the current frame (unless option force
is
used).
bool
If True, then the frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs.
bool
If True (default), the columns in frames are matched by their
names. For example, if one frame has columns [“colA”, “colB”,
“colC”] and the other [“colB”, “colA”, “colC”] then we will swap
the order of the first two columns of the appended frame before
performing the append. However if bynames
is False, then the
column names will be ignored, and the columns will be matched
according to their order, i.e. i-th column in the current frame
to the i-th column in each appended frame.
None
datatable.Frame.replace()¶
Replace given value(s) replace_what
with replace_with
in the entire Frame.
For each replace value, this method operates only on columns of types
appropriate for that value. For example, if replace_what
is a list
[-1, math.inf, None, "??"]
, then the value -1
will be replaced in integer
columns only, math.inf
only in real columns, None
in columns of all types,
and finally "??"
only in string columns.
The replacement value must match the type of the target being replaced,
otherwise an exception will be thrown. That is, a bool must be replaced with a
bool, an int with an int, a float with a float, and a string with a string.
The None
value (representing NA) matches any column type, and therefore can
be used as either replacement target, or replace value for any column. In
particular, the following is valid: DT.replace(None, [-1, -1.0, ""])
. This
will replace NA values in int columns with -1
, in real columns with -1.0
,
and in string columns with an empty string.
The replace operation never causes a column to change its logical type. Thus,
an integer column will remain integer, string column remain string, etc.
However, replacing may cause a column to change its stype, provided that
ltype remains constant. For example, replacing 0
with -999
within an int8
column will cause that column to be converted into the int32
stype.
Parameters¶
None
| bool
| int
| float
| list
| dict
Value(s) to search for and replace.
single value
| list
The replacement value(s). If replace_what
is a single value, then this
must be a single value too. If replace_what
is a list, then this could
be either a single value, or a list of the same length. If replace_what
is a dict, then this value should not be passed.
None
Nothing is returned, the replacement is performed in-place.
Examples¶
df = dt.Frame([1, 2, 3] * 3)
df.replace(1, -1)
df
C0 | ||
---|---|---|
int32 | ||
0 | -1 | |
1 | 2 | |
2 | 3 | |
3 | -1 | |
4 | 2 | |
5 | 3 | |
6 | -1 | |
7 | 2 | |
8 | 3 |
df.replace({-1: 100, 2: 200, "foo": None})
df
C0 | ||
---|---|---|
int32 | ||
0 | 100 | |
1 | 200 | |
2 | 3 | |
3 | 100 | |
4 | 200 | |
5 | 3 | |
6 | 100 | |
7 | 200 | |
8 | 3 |
datatable.Frame.sd()¶
Calculate the standard deviation for each column in the frame.
Parameters¶
Frame
The frame will have one row and the same number/names
of columns as in the current frame. All the columns
will have float64
stype. For non-numeric columns
this function returns NA values.
datatable.Frame.sd1()¶
Calculate the standard deviation for a one-column frame and return it as a scalar.
This function is a shortcut for:
DT.sd()[0, 0]
Parameters¶
None
| float
None
is returned for non-numeric columns.
ValueError
If called on a Frame that has more or less than one column.
datatable.Frame.skew()¶
Calculate the skewness for each column in the frame.
Parameters¶
Frame
The frame will have one row and the same number/names
of columns as in the current frame. All the columns
will have float64
stype. For non-numeric columns
this function returns NA values.
datatable.Frame.skew1()¶
Calculate the skewness for a one-column frame and return it as a scalar.
This function is a shortcut for:
DT.skew()[0, 0]
Parameters¶
None
| float
None
is returned for non-numeric columns.
ValueError
If called on a Frame that has more or less than one column.
datatable.Frame.shape¶
datatable.Frame.sort()¶
Sort frame by the specified column(s).
Parameters¶
List[str | int]
Names or indices of the columns to sort by. If no columns are given, the Frame will be sorted on all columns.
Frame
New Frame sorted by the provided column(s). The current frame remains unmodified.
datatable.Frame.source¶
The name of the file where this frame was loaded from.
This is a read-only property that describes the origin of the frame. When a frame is loaded from a Jay or CSV file, this property will contain the name of that file. Similarly, if the frame was opened from a URL or a from a shell command, the source will report the original URL / the command.
Certain sources may be converted into a Frame only partially,
in such case the source
property will attempt to reflect this
fact. For example, when opening a multi-file zip archive, the
source will contain the name of the file within the archive.
Similarly, when opening an XLS file with several worksheets, the
source property will contain the name of the XLS file, the name of
the worksheet, and possibly even the range of cells that were read.
Parameters¶
str
| None
If the frame was loaded from a file or similar resource, the
name of that file is returned. If the frame was computed, or its
data modified, the property will return None
.
datatable.Frame.stype¶
This property is deprecated and will be removed in version 1.2.0.
Please use .types
instead.
The common dt.stype
for all columns.
This property is well-defined only for frames where all columns have the same stype.
Parameters¶
stype
| None
For frames where all columns have the same stype, this common
stype is returned. If a frame has 0 columns, None
will be
returned.
InvalidOperationError
This exception will be raised if the columns in the frame have different stypes.
datatable.Frame.stypes¶
datatable.Frame.sum()¶
Calculate the sum of all values for each column in the frame.
Parameters¶
Frame
The frame will have one row and the same number/names
of columns as in the current frame. All the columns
will have float64
stype. For non-numeric columns
this function returns NA values.
datatable.Frame.sum1()¶
Calculate the sum of all values for a one-column column frame and return it as a scalar.
This function is a shortcut for:
DT.sum()[0, 0]
Parameters¶
None
| float
None
is returned for non-numeric columns.
ValueError
If called on a Frame that has more or less than one column.
datatable.Frame.tail()¶
Return the last n
rows of the frame.
If the number of rows in the frame is less than n
, then all rows
are returned.
This is a convenience function and it is equivalent to DT[-n:, :]
(except when n
is 0).
Parameters¶
int
The maximum number of rows to return, 10 by default. This number cannot be negative.
Frame
A frame containing the last up to n
rows from the original
frame, and same columns.
Examples¶
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates",
"eggplants", "figs", "grapes", "kiwi"])
DT.tail(3)
A | ||
---|---|---|
str32 | ||
0 | figs | |
1 | grapes | |
2 | kiwi |
datatable.Frame.to_arrow()¶
Convert this frame into a pyarrow.Table
object. The pyarrow
module must be installed.
The conversion is multi-threaded and done in C++, but it does involve creating a copy of the data, except for the cases when the data was originally imported from Arrow. This is caused by differences in the data storage formats of datatable and Arrow.
Parameters¶
pyarrow.Table
A Table
object is always returned, even if the source is a
single-column datatable Frame.
ImportError
If the pyarrow
module is not installed.
datatable.Frame.to_csv()¶
Write the contents of the Frame into a CSV file.
This method uses multiple threads to serialize the Frame’s data. The
number of threads is can be configured using the global option
dt.options.nthreads
.
The method supports simple writing to file, appending to an existing file, or creating a python string if no filename was provided. Optionally, the output could be gzip-compressed.
Parameters¶
str
Path to the output CSV file that will be created. If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.
csv.QUOTE_*
| "minimal"
| "all"
| "nonnumeric"
| "none"
"minimal"
|csv.QUOTE_MINIMAL
quote the string fields only as necessary, i.e. if the string starts or ends with the whitespace, or contains quote characters, separator, or any of the C0 control characters (including newlines, etc).
"all"
|csv.QUOTE_ALL
all fields will be quoted, both string, numeric, and boolean.
"nonnumeric"
|csv.QUOTE_NONNUMERIC
all string fields will be quoted.
"none"
|csv.QUOTE_NONE
none of the fields will be quoted. This option must be used at user’s own risk: the file produced may not be valid CSV.
bool
| "auto"
This option controls whether or not to write headers into the
output file. If this option is not given (or equal to …), then
the headers will be written unless the option append
is True
and the file path
already exists. Thus, by default the headers
will be written in all cases except when appending content into
an existing file.
bool
If True, then insert the byte-order mark into the output file (the option is False by default). Even if the option is True, the BOM will not be written when appending data to an existing file.
According to Unicode standard, including BOM into text files is “neither required nor recommended”. However, some programs (e.g. Excel) may not be able to recognize file encoding without this mark.
bool
If True, then all floating-point values will be printed in hex
format (equivalent to %a format in C printf
). This format is
around 3 times faster to write/read compared to usual decimal
representation, so its use is recommended if you need maximum
speed.
None
| "gzip"
| "auto"
Which compression method to use for the output stream. The default
is “auto”, which tries to infer the compression method from the
output file’s name. The only compression format currently supported
is “gzip”. Compression may not be used when append
is True.
bool
If True, some extra information will be printed to the console, which may help to debug the inner workings of the algorithm.
"mmap"
| "write"
| "auto"
Which method to use for writing to disk. On certain systems ‘mmap’ gives a better performance; on other OSes ‘mmap’ may not work at all.
None
| str
| bytes
None if path
is non-empty. This is the most common case: the
output is written to the file provided.
String containing the CSV text as if it would have been written to a file, if the path is empty or None. If the compression is turned on, a bytes object will be returned instead.
datatable.Frame.to_dict()¶
Convert the frame into a dictionary of lists, by columns.
In Python 3.6+ the order of records in the dictionary will be the same as the order of columns in the frame.
Parameters¶
Dict[str, List]
Dictionary with .ncols
records. Each record
represents a single column: the key is the column’s name, and the
value is the list with the column’s data.
Examples¶
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
DT.to_dict()
See also¶
.to_list()
: convert the frame into a list of lists.to_tuples()
: convert the frame into a list of tuples by rows
datatable.Frame.to_jay()¶
Save this frame to a binary file on disk, in .jay
format.
Parameters¶
str
| None
The destination file name. Although not necessary, we recommend
using extension “.jay” for the file. If the file exists, it will
be overwritten.
If this argument is omitted, the file will be created in memory
instead, and returned as a bytes
object.
'mmap'
| 'write'
| 'auto'
Which method to use for writing the file to disk. The “write”
method is more portable across different operating systems, but
may be slower. This parameter has no effect when path
is
omitted.
datatable.Frame.to_list()¶
Convert the frame into a list of lists, by columns.
Parameters¶
List[List]
A list of .ncols
lists, each inner list
representing one column of the frame.
Examples¶
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
DT.to_list()
dt.Frame(id=range(10)).to_list()
datatable.Frame.to_numpy()¶
Convert frame into a 2D numpy array, optionally forcing it into the specified type.
In a limited set of circumstances the returned numpy array will be created as a data view, avoiding copying the data. This happens if all of these conditions are met:
the frame has only 1 column, which is not virtual;
the column’s type is not string;
the
type
argument was not used.
In all other cases the returned numpy array will have a copy of the frame’s data. If the frame has multiple columns of different stypes, then the values will be upcasted into the smallest common stype.
If the frame has any NA values, then the returned numpy array will
be an instance of numpy.ma.masked_array
.
Parameters¶
Type
| <type-like>
Cast frame into this type before converting it into a numpy
array. Here “type-like” can be any value that is acceptable to the
dt.Type
constructor.
int
Convert a single column instead of the whole frame. This column index can be negative, indicating columns counted from the end of the frame.
numpy.ndarray
| numpy.ma.core.MaskedArray
The returned array will be 2-dimensional with the same .shape
as the original frame. However, if the option column
was used,
then the returned array will be 1-dimensional with the length of
.nrows
.
A masked array is returned if the frame contains NA values but the corresponding numpy array does not support NAs.
ImportError
If the numpy
module is not installed.
datatable.Frame.to_pandas()¶
Convert this frame into a pandas DataFrame.
If the frame being converted has one or more key columns, those columns will become the index in the pandas DataFrame.
Parameters¶
pandas.DataFrame
Pandas dataframe of shape (nrows, ncols-nkeys)
.
ImportError
If the pandas
module is not installed.
datatable.Frame.to_tuples()¶
datatable.Frame.type¶
The common dt.Type
for all columns.
This property is well-defined only for frames where all columns have the same type.
Parameters¶
Type
| None
For frames where all columns have the same type, this common
type is returned. If a frame has 0 columns, None
will be
returned.
InvalidOperationError
This exception will be raised if the columns in the frame have different types.
datatable.Frame.types¶
datatable.ltype¶
This class is deprecated and will be removed in version 1.2.0.
Please use dt.Type
instead.
Enumeration of possible “logical” types of a column.
Logical type is the type stripped away from the details of its physical
storage. For example, ltype.int
represents an integer. Under the hood,
this integer can be stored in several “physical” formats: from
stype.int8
to stype.int64
. Thus, there is a one-to-many relationship
between ltypes and stypes.
Values¶
The following ltype values are currently available:
ltype.bool
ltype.int
ltype.real
ltype.str
ltype.time
ltype.obj
Methods¶
Examples¶
dt.ltype.bool
dt.ltype("int32")
For each ltype, you can find the set of stypes that correspond to it:
dt.ltype.real.stypes
dt.ltype.time.stypes
datatable.ltype.__new__()¶
Find an ltype corresponding to value
.
This method is similar to dt.stype.__new__()
, except that it
returns an ltype instead of an stype.
datatable.Namespace¶
A namespace is an environment that provides lazy access to columns of
a frame when performing computations within
DT[i,j,...]
.
This class should not be instantiated directly, instead use the
singleton instances f
and g
exported from the datatable
module.
Special methods¶
Access columns as attributes. |
|
Access columns by their names / indices. |
datatable.Namespace.__getitem__()¶
Retrieve column(s) by their indices/names/types.
By “retrieve” we actually mean that an expression is created
such that when that expression is used within the
DT[i,j]
call, it would locate and
return the specified column(s).
Parameters¶
int
| str
| slice
| None
| type
| stype
| ltype
| list
| tuple
The column selector:
int
Retrieve the column at the specified index. For example,
f[0]
denotes the first column, whilef[-1]
is the last.str
Retrieve a column by name.
slice
Retrieve a slice of columns from the namespace. Both integer and string slices are supported.
Note that for string slicing, both the start and stop column names are included, unlike integer slicing, where the stop value is not included. Have a look at the examples below for more clarity.
None
Retrieve no columns (an empty columnset).
type
|stype
|ltype
Retrieve columns matching the specified type.
list/tuple
Retrieve columns matching the column names/column positions/column types within the list/tuple.
For example,
f[0, -1]
will return the first and last columns. Have a look at the examples below for more clarity.
FExpr
An expression that selects the specified column from a frame.
See also¶
f-expressions – user guide on using f-expressions.
Notes¶
f-expressions containing a list/tuple of
column names/column positions/column types are
accepted within the j
selector.
Examples¶
from datatable import dt, f, by
df = dt.Frame({'A': [1, 2, 3, 4],
'B': ["tolu", "sammy", "ogor", "boondocks"],
'C': [9.0, 10.0, 11.0, 12.0]})
df
A | B | C | ||
---|---|---|---|---|
int32 | str32 | float64 | ||
0 | 1 | tolu | 9 | |
1 | 2 | sammy | 10 | |
2 | 3 | ogor | 11 | |
3 | 4 | boondocks | 12 |
Select by column position:
df[:, f[0]]
A | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 | |
3 | 4 |
Select by column name:
df[:, f["A"]]
A | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 | |
3 | 4 |
Select a slice:
df[:, f[0 : 2]]
A | B | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | tolu | |
1 | 2 | sammy | |
2 | 3 | ogor | |
3 | 4 | boondocks |
Slicing with column names:
df[:, f["A" : "C"]]
A | B | C | ||
---|---|---|---|---|
int32 | str32 | float64 | ||
0 | 1 | tolu | 9 | |
1 | 2 | sammy | 10 | |
2 | 3 | ogor | 11 | |
3 | 4 | boondocks | 12 |
Note
For string slicing, both the start and stop are included; for integer slicing the stop is not included.
Select by data type:
df[:, f[dt.str32]]
B | ||
---|---|---|
str32 | ||
0 | tolu | |
1 | sammy | |
2 | ogor | |
3 | boondocks |
df[:, f[float]]
C | ||
---|---|---|
float64 | ||
0 | 9 | |
1 | 10 | |
2 | 11 | |
3 | 12 |
Select a list/tuple of columns by position:
df[:, f[0, 1]]
A | B | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | tolu | |
1 | 2 | sammy | |
2 | 3 | ogor | |
3 | 4 | boondocks |
Or by column names:
df[:, f[("A", "B")]]
A | B | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | tolu | |
1 | 2 | sammy | |
2 | 3 | ogor | |
3 | 4 | boondocks |
Note that in the code above, the parentheses are unnecessary, since tuples in python are defined by the presence of a comma. So the below code works as well:
df[:, f["A", "B"]]
A | B | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | tolu | |
1 | 2 | sammy | |
2 | 3 | ogor | |
3 | 4 | boondocks |
Select a list/tuple of data types:
df[:, f[int, float]]
A | C | ||
---|---|---|---|
int32 | float64 | ||
0 | 1 | 9 | |
1 | 2 | 10 | |
2 | 3 | 11 | |
3 | 4 | 12 |
Passing None
within an f-expressions returns an empty columnset:
df[:, f[None]]
0 | |
1 | |
2 | |
3 |
datatable.Namespace.__getattribute__()¶
Retrieve a column from the namespace by name
.
This is a convenience form that can be used to access simply-named
columns. For example: f.Age
denotes a column called "Age"
,
and is exactly equivalent to f['Age']
.
Parameters¶
str
Name of the column to select.
FExpr
An expression that selects the specified column from a frame.
See also¶
.__getitem__()
– retrieving columns via the[]
-notation.
datatable.stype¶
This class is deprecated and will be removed in version 1.2.0.
Please use dt.Type
instead.
Enumeration of possible “storage” types of columns in a Frame.
Each column in a Frame is a vector of values of the same type. We call
this column’s type the “stype”. Most stypes correspond to primitive C types,
such as int32_t
or double
. However some stypes (corresponding to
strings and categoricals) have a more complicated underlying structure.
Notably, datatable
does not support arbitrary structures as
elements of a Column, so the set of stypes is small.
Values¶
The following stype values are currently available:
stype.bool8
stype.int8
stype.int16
stype.int32
stype.int64
stype.float32
stype.float64
stype.str32
stype.str64
stype.obj64
They are available either as properties of the dt.stype
class,
or directly as constants in the dt.
namespace.
For example:
dt.stype.int32
dt.int64
Methods¶
Find stype corresponding to value |
|
Cast a column into the specific stype. |
|
ctypes type corresponding to this stype. |
|
numpy dtype corresponding to this stype. |
|
|
|
|
|
The smallest numeric value for this stype. |
|
The largest numeric value for this stype. |
datatable.stype.__call__()¶
Cast column col
into the new stype.
An stype can be used as a function that converts columns into that
specific stype. In the same way as you could write int(3.14)
in
Python to convert a float value into integer, you can likewise
write dt.int32(f.A)
to convert column A
into stype int32
.
Parameters¶
FExpr
A single- or multi- column expression. All columns will be converted into the desired stype.
FExpr
Expression that converts its inputs into the current stype.
Examples¶
from datatable import dt, f
df = dt.Frame({'A': ['1', '1', '2', '1', '2'],
'B': [None, '2', '3', '4', '5'],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Convert column A from string stype to integer stype:
df[:, dt.int32(f.A)]
A | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 1 | |
2 | 2 | |
3 | 1 | |
4 | 2 |
Convert multiple columns to different stypes:
df[:, [dt.int32(f.A), dt.str32(f.C)]]
A | C | ||
---|---|---|---|
int32 | str32 | ||
0 | 1 | 1 | |
1 | 1 | 2 | |
2 | 2 | 1 | |
3 | 1 | 1 | |
4 | 2 | 2 |
See Also¶
dt.as_type()
– equivalent method of casting a column intoanother stype.
datatable.stype.__new__()¶
Find an stype corresponding to value
.
This method is called when you attempt to construct a new
dt.stype
object, for example dt.stype(int)
. Instead of
actually creating any new stypes, we return one of the existing
stype values.
Parameters¶
str
| type
| np.dtype
An object that will be converted into an stype. This could be
a string such as "integer"
or "int"
or "int8"
, a python
type such as bool
or float
, or a numpy dtype.
ValueError
Raised if value
does not correspond to any stype.
Examples¶
dt.stype(str)
dt.stype("double")
dt.stype(numpy.dtype("object"))
dt.stype("int64")
datatable.stype.ctype¶
ctypes
class that describes the C-level type of each element
in a column with this stype.
For non-fixed-width columns (such as str32
) this will return the ctype
of only the fixed-width component of that column. Thus,
stype.str32.ctype == ctypes.c_int32
.
datatable.stype.dtype¶
numpy.dtype
object that corresponds to this stype.
datatable.stype.ltype¶
dt.ltype
corresponding to this stype. Several stypes may map to
the same ltype, whereas each stype is described by exactly one ltype.
datatable.stype.max¶
The largest finite value that this stype can represent.
datatable.stype.min¶
The smallest finite value that this stype can represent.
datatable.Type¶
Type of data stored in a single column of a Frame.
The type describes both the logical meaning of the data (i.e. an integer, a floating point number, a string, etc.), as well as storage requirement of that data (the number of bits per element). Some types may carry additional properties, such as a timezone or precision.
Values¶
The following types are currently available:
datatable.Type.bool8¶
The type of a column with boolean data.
In a column of this type each data element is stored as 1 byte. NA values are also supported.
The boolean type is considered numeric, where True
is 1 and False
is 0.
Examples¶
DT = dt.Frame([True, False, None]).type
DT.type
DT
C0 | ||
---|---|---|
bool8 | ||
0 | 1 | |
1 | 0 | |
2 | NA |
datatable.Type.date32¶
The date32
type represents a particular calendar date without a time
component. Internally, this type is stored as a 32-bit signed integer
counting the number of days since 1970-01-01 (“the epoch”). Thus, this
type accommodates dates within the range of approximately ±5.8 million
years.
The calendar used for this type is proleptic Gregorian, meaning that it extends the modern-day Gregorian calendar into the past before this calendar was first adopted, and into the future, long after it will have been replaced.
This type corresponds to datetime.date
in Python, pa.date32()
in
pyarrow, and np.dtype('<M8[D]')
in numpy.
Note
Python’s datetime.date
object can accommodate dates from year 1 to
year 9999, which is much smaller than what our date32
type allows.
As a consequence, when date32
values that are outside of year range
1-9999 are converted to python, they become integers instead of
datetime.date
objects.
For the same reason the .min
and .max
properties of this
type also return integers.
Examples¶
from datetime import date
DT = dt.Frame([date(2020, 1, 30), date(2020, 3, 11), None, date(2021, 6, 15)])
DT.type
DT
C0 | ||
---|---|---|
date32 | ||
0 | 2020-01-30 | |
1 | 2020-03-11 | |
2 | NA | |
3 | 2021-06-15 |
dt.Type.date32.min
dt.Type.date32.max
dt.Frame([dt.Type.date32.min, date(2021, 6, 15), dt.Type.date32.max], stype='date32')
C0 | ||
---|---|---|
date32 | ||
0 | -5877641-06-24 | |
1 | 2021-06-15 | |
2 | 5879610-09-09 |
datatable.Type.float32¶
Single-precision floating point type. This corresponds to C type float
.
Each element of this type is 4 bytes long.
datatable.Type.float64¶
Double-precision IEEE-754 floating point type. This corresponds to C type
double
. Each element of this type is 8 bytes long.
datatable.Type.int8¶
Integer type that uses 1 byte per data element and can store values in the
range -127 .. 127
.
This type corresponds to int8_t
in C/C++, int
in python,
np.dtype('int8')
in numpy, and pa.int8()
in pyarrow.
Most arithmetic operations involving this type will produce a result of
type int32
, which follows the convention of the C language.
Examples¶
dt.Type.int8
dt.Type('int8')
dt.Frame([1, 0, 1, 1, 0]).types
datatable.Type.int16¶
Integer type, corresponding to int16_t
in C. This type uses 2 bytes per
data element, and can store values in the range -32767 .. 32767
.
Most arithmetic operations involving this type will produce a result of
type int32
, which follows the convention of the C language.
datatable.Type.int32¶
Integer type, corresponding to int32_t
in C. This type uses 4 bytes per
data element, and can store values in the range -2,147,483,647 ..
2,147,483,647
.
This is the most common type for handling integer data. When a python list of integers is converted into a Frame, a column of this type will usually be created.
Examples¶
DT = dt.Frame([None, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
DT
C0 | ||
---|---|---|
int32 | ||
0 | NA | |
1 | 1 | |
2 | 1 | |
3 | 2 | |
4 | 3 | |
5 | 5 | |
6 | 8 | |
7 | 13 | |
8 | 21 | |
9 | 34 | |
10 | 55 |
datatable.Type.int64¶
Integer type which corresponds to int64_t
in C. This type uses 8 bytes per
data element, and can store values in the range -(2**63-1) .. (2**63-1)
.
datatable.Type.obj64¶
This type can be used to store arbitrary Python objects.
datatable.Type.str32¶
The type of a column with string data.
Internally, this column stores data using 2 buffers: a character buffer,
where all values in the column are stored together as a single large
concatenated string in UTF8 encoding, and an int32
array of offsets
into the character buffer. Consequently, this type can only store up to
2Gb of total character data per column.
Whenever any operation on a string column exceeds the 2Gb limit, this type
will be silently replaced with dt.Type.str64
.
A virtual column that produces string data may have either str32
or
str64
type regardless of how it stores its data.
This column converts to str
type in Python, pa.string()
in
pyarrow, and dtype('object')
in numpy and pandas.
Examples¶
DT = dt.Frame({"to persist": ["one teaspoon", "at a time,",
"the rain turns", "mountains", "into valleys"]})
DT
to persist | ||
---|---|---|
str32 | ||
0 | one teaspoon | |
1 | at a time, | |
2 | the rain turns | |
3 | mountains | |
4 | into valleys |
datatable.Type.str64¶
String type, where the offsets buffer uses 64-bit integers.
datatable.Type.time64¶
The time64
type is used to represent a specific moment in time. This
corresponds to datetime
in Python, or timestamp
in Arrow or pandas.
Internally, this type is stored as a 64-bit integer containing the number of
nanoseconds since the epoch (Jan 1, 1970) in UTC.
This type is not leap-seconds aware, meaning that it assumes that each day
has exactly 24×3600 seconds. In practice it means that calculating time
difference between two time64
moments may be off by the number of leap
seconds that have occurred between them.
Currently, time64
type is not timezone-aware, addition of time zones is
planned for the next release.
A time64 column converts into datetime.datetime
objects in python, a
pa.timestamp('ns')
type in pyarrow and dtype('datetime64[ns]')
in
numpy and pandas.
Examples¶
DT = dt.Frame(["2018-01-31 03:16:57", "2021-06-15 15:44:23.951", None, "1965-11-25 19:29:00"])
DT[0] = dt.Type.time64
DT
C0 | ||
---|---|---|
time64 | ||
0 | 2018-01-31T03:16:57 | |
1 | 2021-06-15T15:44:23.951 | |
2 | NA | |
3 | 1965-11-25T19:29:00 |
dt.Type.time64.min
dt.Type.time64.max
datatable.Type.void¶
The type of a column where all values are NAs.
In datatable, any column can have NA values in it. There is, however,
a special type that can be assigned for a column where all values are
NAs: void
. This type’s special property is that it can be used in
place where any other type could be expected.
A column of this type does not occupy any storage space. Unlike other types, it does not use the validity buffer either: all values are known to be invalid.
It converts into pyarrow’s pa.null()
type, or '|V0'
dtype in numpy.
Examples¶
DT = dt.Frame([None, None, None])
DT.type
DT
C0 | ||
---|---|---|
void | ||
0 | NA | |
1 | NA | |
2 | NA |
datatable.Type.max¶
The largest finite value that this type can represent, if applicable.
Parameters¶
Any
The type of the returned value corresponds to the Type object: an int
for integer types, a float for floating-point types, etc. If the type
has no well-defined max value then None
is returned.
Examples¶
dt.Type.int32.max
dt.Type.float64.max
dt.Type.date32.max
datatable.Type.min¶
The smallest finite value that this type can represent, if applicable.
Parameters¶
Any
The type of the returned value corresponds to the Type object: an int
for integer types, a float for floating-point types, etc. If the type
has no well-defined min value then None
is returned.
Examples¶
dt.Type.int8.min
dt.Type.float32.min
dt.Type.date32.min
datatable.as_type()¶
Convert columns cols
into the prescribed stype.
This function does not modify the data in the original column. Instead it returns a new column which converts the values into the new type on the fly.
Parameters¶
FExpr
Single or multiple columns that need to be converted.
Type
| stype
Target type.
FExpr
The output will have the same number of rows and columns as the input; column names will be preserved too.
Examples¶
from datatable import dt, f, as_type
df = dt.Frame({'A': ['1', '1', '2', '1', '2'],
'B': [None, '2', '3', '4', '5'],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Convert column A from string to integer type:
df[:, as_type(f.A, int)]
A | ||
---|---|---|
int64 | ||
0 | 1 | |
1 | 1 | |
2 | 2 | |
3 | 1 | |
4 | 2 |
The exact dtype can be specified:
df[:, as_type(f.A, dt.Type.int32)]
A | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 1 | |
2 | 2 | |
3 | 1 | |
4 | 2 |
Convert multiple columns to different types:
df[:, [as_type(f.A, int), as_type(f.C, dt.str32)]]
A | C | ||
---|---|---|---|
int64 | str32 | ||
0 | 1 | 1 | |
1 | 1 | 2 | |
2 | 2 | 1 | |
3 | 1 | 1 | |
4 | 2 | 2 |
datatable.build_info¶
This is a python struct that contains information about the installed datatable module. The following fields are available:
str
The version string of the current build. Several formats of the version string are possible:
{MAJOR}.{MINOR}.{MICRO}
– the release version string, such as"0.11.0"
.{RELEASE}a{DEVNUM}
– version string for the development build of datatable, where{RELEASE}
is the normal release string and{DEVNUM}
is an integer that is incremented with each build. For example:"0.11.0a1776"
.{RELEASE}a0+{SUFFIX}
– version string for a PR build of datatable, where the{SUFFIX}
is formed from the PR number and the build sequence number. For example,"0.11.0a0+pr2602.13"
.{RELEASE}a0+{FLAVOR}.{TIMESTAMP}.{USER}
– version string used for local builds. This contains the “flavor” of the build, such as normal build, or debug, or coverage, etc; the unix timestamp of the build; and lastly the system user name of the user who made the build.
str
UTC timestamp (date + time) of the build.
str
The type of datatable build. Usually this will be "release"
, but
may also be "debug"
if datatable was built in debug mode. Other
build modes exist, or may be added in the future.
str
The version of the compiler used to build the C++ datatable extension. This will include both the name and the version of the compiler.
str
Git-hash of the revision from which the build was made, as
obtained from git rev-parse HEAD
.
str
Name of the git branch from where the build was made. This will
be obtained from environment variable CHANGE_BRANCH
if defined,
or from command git rev-parse --abbrev-ref HEAD
otherwise.
str
Timestamp of the git commit from which the build was made.
str
If the source tree contains any uncommitted changes (compared
to the checked out git revision), then the summary of these
changes will be in this field, as reported by
git diff HEAD --stat --no-color
. Otherwise, this field
is an empty string.
datatable.by()¶
Group-by clause for use in Frame’s square-bracket selector.
Whenever a by()
object is present inside a DT[i, j, ...]
expression, it makes all other expressions to be evaluated in
group-by mode. This mode causes the following changes to the
evaluation semantics:
A “Groupby” object will be computed for the frame
DT
, grouping it by columns specified as the arguments to theby()
call. This object keeps track of which rows of the frame belong to which group.If an
i
expression is present (row filter), it will be interpreted within each group. For example, ifi
is a slice, then the slice will be applied separately to each group. Similarly, ifi
expression contains a formula with reduce functions, then those functions will be evaluated for each group. For example:DT[f.A == max(f.A), :, by(f.group_id)]
will select those rows where column A reaches its peak value within each group (there could be multiple such rows within each group).
Before
j
is evaluated, theby()
clause adds all its columns at the start ofj
(unlessadd_columns
argument is False). Ifj
is a “select-all” slice (i.e.:
), then those columns will also be excluded from the list of all columns so that they will be present in the output only once.During evaluation of
j
, the reducer functions, such asmin()
,sum()
, etc, will be evaluated by-group, that is they will find the minimal value in each group, the sum of values in each group, and so on. If a reducer expression is combined with a regular column expression, then the reduced column will be auto-expanded into a column that is constant within each group.Note that if both
i
andj
contain reducer functions, then those functions will have a slightly different notion of groups: the reducers ini
will see each group “in full”, whereas the reducers inj
will see each group after it was filtered by the expression ini
(and possibly not even see some of the groups at all, if they were filtered out completely).If
j
contains only reducer expressions, then the final result will be a Frame containing just a single row for each group. This resulting frame will also be keyed by the grouped-by columns.
The by()
function expects a single column or a sequence of columns
as the argument(s). It accepts either a column name, or an
f-expression. In particular, you can perform a group-by on a
dynamically computed expression:
DT[:, :, by(dt.math.floor(f.A/100))]
The default behavior of groupby is to sort the groups in the ascending
order, with NA values appearing before any other values. As a special
case, if you group by an expression -f.A
, then it will be
treated as if you requested to group by the column “A” sorting it in
the descending order. This will work even with column types that are
not arithmetic, for example “A” could be a string column here.
Examples¶
from datatable import dt, f, by
df = dt.Frame({"group1": ["A", "A", "B", "B", "A"],
"group2": [1, 0, 1, 1, 1],
"var1": [343, 345, 567, 345, 212]})
df
group1 | group2 | var1 | ||
---|---|---|---|---|
str32 | int8 | int32 | ||
0 | A | 1 | 343 | |
1 | A | 0 | 345 | |
2 | B | 1 | 567 | |
3 | B | 1 | 345 | |
4 | A | 1 | 212 |
Group by a single column:
df[:, dt.count(), by("group1")]
group1 | count | ||
---|---|---|---|
str32 | int64 | ||
0 | A | 3 | |
1 | B | 2 |
Group by multiple columns:
df[:, dt.sum(f.var1), by("group1", "group2")]
group1 | group2 | var1 | ||
---|---|---|---|---|
str32 | int8 | int64 | ||
0 | A | 0 | 345 | |
1 | A | 1 | 555 | |
2 | B | 1 | 912 |
Return grouping result without the grouping column(s) by setting the
add_columns
parameter to False
:
df[:, dt.sum(f.var1), by("group1", "group2", add_columns=False)]
var1 | ||
---|---|---|
int64 | ||
0 | 345 | |
1 | 555 | |
2 | 912 |
f-expressions can be passed to by()
:
df[:, dt.count(), by(f.var1 < 400)]
C0 | count | ||
---|---|---|---|
bool8 | int64 | ||
0 | 0 | 1 | |
1 | 1 | 4 |
By default, the groups are sorted in ascending order. The inverse is
possible by negating the f-expressions in by()
:
df[:, dt.count(), by(-f.group1)]
group1 | count | ||
---|---|---|---|
str32 | int64 | ||
0 | B | 2 | |
1 | A | 3 |
An integer can be passed to the i
section:
df[0, :, by("group1")]
group1 | group2 | var1 | ||
---|---|---|---|---|
str32 | int8 | int32 | ||
0 | A | 1 | 343 | |
1 | B | 1 | 567 |
A slice is also acceptable within the i
section:
df[-1:, :, by("group1")]
group1 | group2 | var1 | ||
---|---|---|---|---|
str32 | int8 | int32 | ||
0 | A | 1 | 212 | |
1 | B | 1 | 345 |
Note
f-expressions is not implemented yet for the i
section in a
groupby. Also, a sequence cannot be passed to the i
section in the
presence of by()
.
See Also¶
Grouping with by() user guide for more examples.
datatable.cbind()¶
Create a new Frame by appending columns from several frames
.
This function is exactly equivalent to:
dt.Frame().cbind(*frames, force=force)
See also¶
rbind()
– function for row-binding several frames.dt.Frame.cbind()
– Frame method for cbinding some frames to another.
Examples¶
from datatable import dt, f
DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0])
DT
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 4 | |
1 | 2 | 7 | |
2 | 3 | 0 |
frame1 = dt.Frame(N=[-1, -2, -5])
frame1
N | ||
---|---|---|
int32 | ||
0 | -1 | |
1 | -2 | |
2 | -5 |
dt.cbind([DT, frame1])
A | B | N | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | 4 | -1 | |
1 | 2 | 7 | -2 | |
2 | 3 | 0 | -5 |
If the number of rows are not equal, you can force the binding by setting
the force
parameter to True
:
frame2 = dt.Frame(N=[-1, -2, -5, -20])
frame2
N | ||
---|---|---|
int32 | ||
0 | -1 | |
1 | -2 | |
2 | -5 | |
3 | -20 |
dt.cbind([DT, frame2], force=True)
A | B | N | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | 4 | -1 | |
1 | 2 | 7 | -2 | |
2 | 3 | 0 | -5 | |
3 | NA | NA | -20 |
datatable.corr()¶
Calculate the
Pearson correlation
between col1
and col2
.
Parameters¶
Expr
Input columns.
Examples¶
from datatable import dt, f
DT = dt.Frame(A = [0, 1, 2, 3], B = [0, 2, 4, 6])
DT
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 0 | 0 | |
1 | 1 | 2 | |
2 | 2 | 4 | |
3 | 3 | 6 |
DT[:, dt.corr(f.A, f.B)]
C0 | ||
---|---|---|
float64 | ||
0 | 1 |
datatable.count()¶
Calculate the number of non-missing values for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names and number of columns
as in cols
. All the returned column stypes are int64
.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric and non-string type.
Examples¶
from datatable import dt, f
df = dt.Frame({'A': [1, 1, 2, 1, 2],
'B': [None, 2, 3,4, 5],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Get the count of all rows:
df[:, dt.count()]
count | ||
---|---|---|
int32 | ||
0 | 5 |
Get the count of column B
(note how the null row is excluded from the
count result):
df[:, dt.count(f.B)]
B | ||
---|---|---|
int64 | ||
0 | 4 |
datatable.cov()¶
Calculate covariance
between col1
and col2
.
Parameters¶
Expr
Input columns.
Examples¶
from datatable import dt, f
DT = dt.Frame(A = [0, 1, 2, 3], B = [0, 2, 4, 6])
DT
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 0 | 0 | |
1 | 1 | 2 | |
2 | 2 | 4 | |
3 | 3 | 6 |
DT[:, dt.cov(f.A, f.B)]
C0 | ||
---|---|---|
float64 | ||
0 | 3.33333 |
datatable.cut()¶
For each column from cols
bin its values into equal-width intervals,
when nbins
is specified, or into arbitrary-width intervals,
when interval edges are provided as bins
.
Parameters¶
FExpr
Input data for equal-width interval binning.
int
| List[int]
bool
Each binning interval is half-open. This flag indicates whether the right edge of the interval is closed, or not.
FExpr
f-expression that converts input columns into the columns filled with the respective bin ids.
datatable.dt¶
This is the datatable
module itself.
The purpose of exporting this symbol is so that you can easily import all the things you need from the datatable module in one go:
from datatable import dt, f, g, by, join, mean
Note: while it is possible to write
test = dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('test.jay')
train = dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('train.jay')
we do not in fact recommend doing so (except possibly on April 1st).
datatable.f¶
The main Namespace
object.
The function of this object is that during the evaluation of a
DT[i,j]
call, the variable f
represents the columns of frame DT
.
Specifically, within expression DT[i, j]
the following
is true:
f.A
means “column A” of frameDT
;f[2]
means “3rd colum” of frameDT
;f[int]
means “all integer columns” ofDT
;f[:]
means “all columns” ofDT
.
datatable.first()¶
Return the first row for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names, stypes and
number of columns as in cols
.
Examples¶
first()
returns the first column in a frame:
from datatable import dt, f, by, sort, first
df = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, 5]})
df
A | B | ||
---|---|---|---|
0 | 1 | NA | |
1 | 1 | 2 | |
2 | 2 | 3 | |
3 | 1 | 4 | |
4 | 2 | 5 |
dt.first(df)
A | ||
---|---|---|
0 | 1 | |
1 | 1 | |
2 | 2 | |
3 | 1 | |
4 | 2 |
Within a frame, it returns the first row:
df[:, first(f[:])]
A | B | ||
---|---|---|---|
0 | 1 | NA |
Of course, you can replicate this by passing 0 to the i
section instead:
df[0, :]
A | B | ||
---|---|---|---|
0 | 1 | NA |
first()
comes in handy if you wish to get the first non null value in a
column:
df[f.B != None, first(f.B)]
B | ||
---|---|---|
0 | 2 |
first()
returns the first row per group in a by()
operation:
df[:, first(f[:]), by("A")]
A | B | ||
---|---|---|---|
0 | 1 | NA | |
1 | 2 | 3 |
To get the first non-null value per row in a by()
operation, you can
use the sort()
function, and set the na_position
argument as
last
:
df[:, first(f[:]), by("A"), sort("B", na_position="last")]
A | B | ||
---|---|---|---|
0 | 1 | 2 | |
1 | 2 | 3 |
datatable.fread()¶
This function is capable of reading data from a variety of input formats,
producing a Frame
as the result. The recognized formats are:
CSV, Jay, XLSX, and plain text. In addition, the data may be inside an
archive such as .tar
, .gz
, .zip
, .gz2
, and .tgz
.
Parameters¶
str
| bytes
| file
| Pathlike
| List
The first (unnamed) argument to fread is the input source.
Multiple types of sources are supported, and they can be named
explicitly: file
, text
, cmd
, and url
. When the source is
not named, fread will attempt to guess its type. The most common
type is file
, but sometimes the argument is resolved as text
(if the string contains newlines) or url
(if the string starts
with https://
or similar).
Only one argument out of anysource
, file
, text
, cmd
or
url
can be specified at once.
str
| file
| Pathlike
A file source can be either the name of the file on disk, or a
python “file-like” object – i.e. any object having method
.read()
.
Generally, specifying a file name should be preferred, since
reading from a Python file
can only be done in single-threaded
mode.
This argument also supports addressing files inside an archive,
or sheets inside an Excel workbook. Simply write the name of the
file as if the archive was a folder: "data.zip/train.csv"
.
str
| bytes
Instead of reading data from file, this argument provides the data as a simple in-memory blob.
str
A command that will be executed in the shell and its output then read as text.
str
This parameter can be used to specify the URL of the input file. The data will first be downloaded into a temporary directory and then read from there. In the end the temporary files will be removed.
We use the standard urllib.request
module to download the
data. Changing the settings of that module, for example installing
proxy, password, or cookie managers will allow you to customize
the download process.
...
Limit which columns to read from the input file.
str
| None
Field separator in the input file. If this value is None
(default) then the separator will be auto-detected. Otherwise it
must be a single-character string. When sep='\n'
, then the
data will be read in single-column mode. Characters
["'`0-9a-zA-Z]
are not allowed as the separator, as well as
any non-ASCII characters.
"."
| ","
Decimal point symbol for floating-point numbers.
int
The maximum number of rows to read from the file. Setting this parameter to any negative number is equivalent to have no limit at all. Currently this parameter doesn’t always work correctly.
bool
| None
If True
then the first line of the CSV file contains the header.
If False
then there is no header. By default the presence of the
header is heuristically determined from the contents of the file.
List[str]
The list of strings that were used in the input file to represent NA values.
bool
If True
then the lines of the CSV file are allowed to have
uneven number of fields. All missing fields will be filled with
NAs in the resulting frame.
str
| None
If this parameter is provided, then the input will be recoded
from this encoding into UTF-8 before reading. Any encoding
registered with the python codec
module can be used.
str
| None
Start reading the file from the line containing this string. All
previous lines will be skipped and discarded. This parameter
cannot be used together with skip_to_line
.
int
If this setting is given, then this many lines in the file will
be skipped before we start to parse the file. This can be used
for example when several first lines in the file contain non-CSV
data and therefore must be skipped. This parameter cannot be
used together with skip_to_string
.
bool
If True
, then any empty lines in the input will be skipped. If
this parameter is False
then: (a) in single-column mode empty
lines are kept as empty lines; otherwise (b) if fill=True
then
empty lines produce a single line filled with NAs in the output;
otherwise (c) an dt.exceptions.IOError
is raised.
bool
If True
, then the leading/trailing whitespace will be stripped
from unquoted string fields. Whitespace is always skipped from
numeric fields.
'"'
| "'"
| "`"
The character that was used to quote fields in the CSV file. By
default the double-quote mark '"'
is assumed.
str
| None
Use this directory for storing temporary files as needed. If not
provided then the system temporary directory will be used, as
determined via the tempfile
Python module.
int
| None
Number of threads to use when reading the file. This number cannot
exceed the number of threads in the pool dt.options.nthreads
.
If 0
or negative number of threads is requested, then it will be
treated as that many threads less than the maximum. By default
all threads in the thread pool are used.
bool
If True
, then print detailed information about the internal
workings of fread to stdout (or to logger
if provided).
object
Logger object that will receive verbose information about fread’s
progress. When this parameter is specified, verbose
mode will
be turned on automatically.
"warn"
| "error"
| "ignore"
Action that should be taken when the input resolves to multiple
distinct sources. By default, ("warn"
) a warning will be issued
and only the first source will be read and returned as a Frame.
The "ignore"
action is similar, except that the extra sources
will be discarded without a warning. Lastly, an dt.exceptions.IOError
can be raised if the value of this parameter is "error"
.
If you want all sources to be read instead of only the first one
then consider using iread()
.
int
Try not to exceed this amount of memory allocation (in bytes) when reading the data. This limit is advisory and not enforced very strictly.
This setting is useful when reading data from a file that is substantially larger than the amount of RAM available on your machine.
When this parameter is specified and fread sees that it needs more RAM than the limit in order to read the input file, then it will dump the data that was read so far into a temporary file in binary format. In the end the returned Frame will be partially composed from data located on disk, and partially from the data in memory. It is advised to either store this data as a Jay file or filter and materialize the frame (if not the performance may be slow).
dt.exceptions.IOError
See Also¶
Fread Examples user guide for usage examples.
datatable.g¶
Secondary Namespace
object.
The function of this object is that during the evaluation of a
DT[..., join(X)]
call, the variable
g
represents the columns of the joined frame X
. In SQL
this would have been equivalent to ... JOIN tableX AS g ...
.
datatable.ifelse()¶
An expression that chooses its value based on one or more conditions.
This is roughly equivalent to the following Python code:
result = value1 if condition1 else \
value2 if condition2 else \
... else \
default
For every row this function evaluates the smallest number of expressions
necessary to get the result. Thus, it evaluates condition1
, condition2
,
and so on until it finds the first condition that evaluates to True
.
It then computes and returns the corresponding value
. If all conditions
evaluate to False
, then the default
value is computed and returned.
Also, if any of the conditions produces NA then the result of the expression also becomes NA without evaluating any further conditions or values.
Parameters¶
FExpr[bool]
Expressions each producing a single boolean column. These conditions
will be evaluated in order until we find the one equal to True
.
FExpr
Values that will be used when the corresponding condition evaluates
to True
. These must be single columns.
FExpr
Value that will be used when all conditions evaluate to False
.
This must be a single column.
Examples¶
Single condition¶
Task: Create a new column Colour
, where if Set
is 'Z'
then the
value should be 'Green'
, else 'Red'
:
from datatable import dt, f, by, ifelse, update
df = dt.Frame("""Type Set
A Z
B Z
B X
C Y""")
df[:, update(Colour = ifelse(f.Set == "Z", # condition
"Green", # if condition is True
"Red")) # if condition is False
]
df
Type | Set | Colour | ||
---|---|---|---|---|
str32 | str32 | str32 | ||
0 | A | Z | Green | |
1 | B | Z | Green | |
2 | B | X | Red | |
3 | C | Y | Red |
Multiple conditions¶
Task: Create new column value
whose value is taken from columns a
,
b
, or c
– whichever is nonzero first:
df = dt.Frame({"a": [0,0,1,2],
"b": [0,3,4,5],
"c": [6,7,8,9]})
df
a | b | c | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 0 | 0 | 6 | |
1 | 0 | 3 | 7 | |
2 | 1 | 4 | 8 | |
3 | 2 | 5 | 9 |
df['value'] = ifelse(f.a > 0, f.a, # first condition and result
f.b > 0, f.b, # second condition and result
f.c) # default if no condition is True
df
a | b | c | value | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 0 | 0 | 6 | 6 | |
1 | 0 | 3 | 7 | 3 | |
2 | 1 | 4 | 8 | 1 | |
3 | 2 | 5 | 9 | 2 |
datatable.init_styles()¶
Inject datatable’s stylesheets into the Jupyter notebook. This function does nothing when it runs in a normal Python environment outside of Jupyter.
When datatable runs in a Jupyter notebook, it renders its Frames as HTML tables. The appearance of these tables is enhanced using a custom stylesheet, which must be injected into the notebook at any point on the page. This is exactly what this function does.
Normally, this function is called automatically when datatable
is imported. However, in some circumstances Jupyter erases these
stylesheets (for example, if you run import datatable
cell
twice). In such cases, you may need to call this method manually.
datatable.intersect()¶
Find the intersection of sets of values in the frames
.
Each frame should have only a single column or be empty.
The values in each frame will be treated as a set, and this function will
perform the
intersection operation
on these sets, returning those values that are present in each
of the provided frames
.
Parameters¶
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns in the frames
.
ValueError
| NotImplementedError
raised when one of the input frames has more than one column. |
|
raised when one of the columns has stype |
Examples¶
from datatable import dt
s1 = dt.Frame([4, 5, 6, 20, 42])
s2 = dt.Frame([1, 2, 3, 5, 42])
s1
C0 | ||
---|---|---|
int32 | ||
0 | 4 | |
1 | 5 | |
2 | 6 | |
3 | 20 | |
4 | 42 |
s2
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 | |
3 | 5 | |
4 | 42 |
Intersection of the two frames:
dt.intersect([s1, s2])
C0 | ||
---|---|---|
int32 | ||
0 | 5 | |
1 | 42 |
datatable.iread()¶
This function is similar to fread()
, but allows reading
multiple sources at once. For example, this can be used when the
input is a list of files, or a glob pattern, or a multi-file archive,
or multi-sheet XLSX file, etc.
Parameters¶
...
Most parameters are the same as in fread()
. All parse
parameters will be applied to all input files.
"warn"
| "raise"
| "ignore"
| "store"
What action to take when one of the input sources produces an
error. Possible actions are: "warn"
– each error is converted
into a warning and emitted to user, the source that produced the
error is then skipped; "raise"
– the errors are raised
immediately and the iteration stops; "ignore"
– the erroneous
sources are silently ignored; "store"
– when an error is
raised, it is captured and returned to the user, then the iterator
continues reading the subsequent sources.
Iterator[Frame]
| Iterator[Frame|Exception]
The returned object is an iterator that produces Frame
s.
The iterator is lazy: each frame is read only as needed, after the
previous frame was “consumed” by the user. Thus, the user can
interrupt the iterator without having to read all the frames.
Each Frame
produced by the iterator has a .source
attribute that describes the source of each frame as best as
possible. Each source depends on the type of the input: either a
file name, or a URL, or the name of the file in an archive, etc.
If the errors
parameter is "store"
then the iterator may
produce either Frames or exception objects.
datatable.join()¶
Join clause for use in Frame’s square-bracket selector.
This clause is equivalent to the SQL JOIN
, though for the moment
datatable only supports left outer joins. In order to join,
the frame
must be keyed
first, and then joined
to another frame DT
as:
DT[:, :, join(X)]
provided that DT
has the column(s) with the same name(s) as
the key in frame
.
Parameters¶
Frame
An input keyed frame to be joined to the current one.
Join Object
In most of the cases the returned object is directly used in the Frame’s square-bracket selector.
ValueError
The exception is raised if frame
is not keyed.
See Also¶
Examples¶
df1 = dt.Frame(""" date X1 X2
01-01-2020 H 10
01-02-2020 H 30
01-03-2020 Y 15
01-04-2020 Y 20""")
df2 = dt.Frame("""X1 X3
H 5
Y 10""")
First, create a key on the right frame (df2
). Note that the join key
(X1
) has unique values and has the same name in the left frame (df1
):
df2.key = "X1"
Join is now possible:
df1[:, :, join(df2)]
date | X1 | X2 | X3 | ||
---|---|---|---|---|---|
str32 | str32 | int32 | int32 | ||
0 | 01-01-2020 | H | 10 | 5 | |
1 | 01-02-2020 | H | 30 | 5 | |
2 | 01-03-2020 | Y | 15 | 10 | |
3 | 01-04-2020 | Y | 20 | 10 |
You can refer to columns of the joined frame using prefix g.
, similar to how columns of the left frame can be accessed using prefix f.
:
df1[:, update(X2=f.X2 * g.X3), join(df2)]
df1
date | X1 | X2 | ||
---|---|---|---|---|
str32 | str32 | int32 | ||
0 | 01-01-2020 | H | 50 | |
1 | 01-02-2020 | H | 150 | |
2 | 01-03-2020 | Y | 150 | |
3 | 01-04-2020 | Y | 200 |
datatable.last()¶
Return the last row for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names, stypes and
number of columns as in cols
.
Examples¶
last()
returns the last column in a frame:
from datatable import dt, f, by, sort, last
df = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, None]})
df
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | NA | |
1 | 1 | 2 | |
2 | 2 | 3 | |
3 | 1 | 4 | |
4 | 2 | NA |
dt.last(df)
B | ||
---|---|---|
int32 | ||
0 | NA | |
1 | 2 | |
2 | 3 | |
3 | 4 | |
4 | NA |
Within a frame, it returns the last row:
df[:, last(f[:])]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 2 | NA |
The above code can be replicated by passing -1 to the i
section instead:
df[-1, :]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 2 | NA |
Like first()
, last()
can be handy if you wish to get the last
non null value in a column:
df[f.B != None, dt.last(f.B)]
B | ||
---|---|---|
int32 | ||
0 | 4 |
last()
returns the last row per group in a by()
operation:
df[:, last(f[:]), by("A")]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 4 | |
1 | 2 | NA |
To get the last non-null value per row in a by()
operation, you can
use the sort()
function, and set the na_position
argument as
first
(this will move the NAs
to the top of the column):
df[:, last(f[:]), by("A"), sort("B", na_position="first")]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 4 | |
1 | 2 | 3 |
datatable.max()¶
Calculate the maximum value for each column from cols
. It is recommended
to use it as dt.max()
to prevent conflict with the Python built-in
max()
function.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row and the same names, stypes and number
of columns as in cols
.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
Examples¶
from datatable import dt, f, by
df = dt.Frame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'B': [3, 2, 20, 1, 6, 2, 3, 22, 1]})
df
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 3 | |
1 | 1 | 2 | |
2 | 1 | 20 | |
3 | 2 | 1 | |
4 | 2 | 6 | |
5 | 2 | 2 | |
6 | 3 | 3 | |
7 | 3 | 22 | |
8 | 3 | 1 |
Get the maximum from column B:
df[:, dt.max(f.B)]
B | ||
---|---|---|
int32 | ||
0 | 22 |
Get the maximum of all columns:
df[:, [dt.max(f.A), dt.max(f.B)]]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 3 | 22 |
Same as above, but more convenient:
df[:, dt.max(f[:])]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 3 | 22 |
In the presence of by()
, it returns the row with the maximum
value per group:
df[:, dt.max(f.B), by("A")]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 20 | |
1 | 2 | 6 | |
2 | 3 | 22 |
datatable.mean()¶
Calculate the mean value for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names and number of columns
as in cols
. The column stypes are float32
for
float32
columns, and float64
for all the other numeric types.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
See Also¶
Examples¶
from datatable import dt, f, by
df = dt.Frame({'A': [1, 1, 2, 1, 2],
'B': [None, 2, 3,4, 5],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Get the mean from column A:
df[:, dt.mean(f.A)]
A | ||
---|---|---|
float64 | ||
0 | 1.4 |
Get the mean of multiple columns:
df[:, dt.mean([f.A, f.B])]
A | B | ||
---|---|---|---|
float64 | float64 | ||
0 | 1.4 | 3.5 |
Same as above, but applying to a column slice:
df[:, dt.mean(f[:2])]
A | B | ||
---|---|---|---|
float64 | float64 | ||
0 | 1.4 | 3.5 |
You can pass in a dictionary with new column names:
df[:, dt.mean({"A_mean": f.A, "C_avg": f.C})]
A_mean | C_avg | ||
---|---|---|---|
float64 | float64 | ||
0 | 1.4 | 1.4 |
In the presence of by()
, it returns the average of each column per group:
df[:, dt.mean({"A_mean": f.A, "B_mean": f.B}), by("C")]
C | A_mean | B_mean | ||
---|---|---|---|---|
int32 | float64 | float64 | ||
0 | 1 | 1.33333 | 3.5 | |
1 | 2 | 1.5 | 3.5 |
datatable.median()¶
Calculate the median value for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names, stypes and
number of columns as in cols
.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
See Also¶
Examples¶
from datatable import dt, f, by
df = dt.Frame({'A': [1, 1, 2, 1, 2],
'B': [None, 2, 3,4, 5],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Get the median from column A:
df[:, dt.median(f.A)]
A | ||
---|---|---|
float64 | ||
0 | 1 |
Get the median of multiple columns:
df[:, dt.median([f.A, f.B])]
A | B | ||
---|---|---|---|
float64 | float64 | ||
0 | 1 | 3.5 |
Same as above, but more convenient:
df[:, dt.median(f[:2])]
A | B | ||
---|---|---|---|
float64 | float64 | ||
0 | 1 | 3.5 |
You can pass in a dictionary with new column names:
df[:, dt.median({"A_median": f.A, "C_mid": f.C})]
A_median | C_mid | ||
---|---|---|---|
float64 | float64 | ||
0 | 1 | 1 |
In the presence of by()
, it returns the median of each column
per group:
df[:, dt.median({"A_median": f.A, "B_median": f.B}), by("C")]
C | A_median | B_median | ||
---|---|---|---|---|
int32 | float64 | float64 | ||
0 | 1 | 1 | 3.5 | |
1 | 2 | 1.5 | 3.5 |
datatable.min()¶
Calculate the minimum value for each column from cols
. It is recommended
to use it as dt.min()
to prevent conflict with the Python built-in
min()
function.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row and the same names, stypes and number
of columns as in cols
.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
Examples¶
from datatable import dt, f, by
df = dt.Frame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'B': [3, 2, 20, 1, 6, 2, 3, 22, 1]})
df
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 3 | |
1 | 1 | 2 | |
2 | 1 | 20 | |
3 | 2 | 1 | |
4 | 2 | 6 | |
5 | 2 | 2 | |
6 | 3 | 3 | |
7 | 3 | 22 | |
8 | 3 | 1 |
Get the minimum from column B:
df[:, dt.min(f.B)]
B | ||
---|---|---|
int32 | ||
0 | 1 |
Get the minimum of all columns:
df[:, [dt.min(f.A), dt.min(f.B)]]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 1 |
Same as above, but using the slice notation:
df[:, dt.min(f[:])]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 1 |
In the presence of by()
, it returns the row with the minimum value
per group:
df[:, dt.min(f.B), by("A")]
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | 2 | |
1 | 2 | 1 | |
2 | 3 | 1 |
datatable.qcut()¶
Bin all the columns from cols
into intervals with approximately
equal populations. Thus, the intervals are chosen according to
the sample quantiles of the data.
If there are duplicate values in the data, they will all be placed into the same bin. In extreme cases this may cause the bins to be highly unbalanced.
Parameters¶
FExpr
Input data for quantile binning.
int
| List[int]
FExpr
f-expression that converts input columns into the columns filled with the respective quantile ids.
datatable.rbind()¶
Produce a new frame by appending rows of several frames
.
This function is equivalent to:
dt.Frame().rbind(*frames, force=force, by_names=by_names)
Examples¶
from datatable import dt
DT1 = dt.Frame({"Weight": [5, 4, 6], "Height": [170, 172, 180]})
DT1
Weight | Height | ||
---|---|---|---|
int32 | int32 | ||
0 | 5 | 170 | |
1 | 4 | 172 | |
2 | 6 | 180 |
DT2 = dt.Frame({"Height": [180, 181, 169], "Weight": [4, 4, 5]})
DT2
Weight | Height | ||
---|---|---|---|
int32 | int32 | ||
0 | 4 | 180 | |
1 | 4 | 181 | |
2 | 5 | 169 |
dt.rbind(DT1, DT2)
Weight | Height | ||
---|---|---|---|
int32 | int32 | ||
0 | 5 | 170 | |
1 | 4 | 172 | |
2 | 6 | 180 | |
3 | 4 | 180 | |
4 | 4 | 181 | |
5 | 5 | 169 |
rbind()
by default combines frames by names. The frames can also be
bound by column position by setting the bynames
parameter to False
:
dt.rbind(DT1, DT2, bynames = False)
Weight | Height | ||
---|---|---|---|
int32 | int32 | ||
0 | 5 | 170 | |
1 | 4 | 172 | |
2 | 6 | 180 | |
3 | 180 | 4 | |
4 | 181 | 4 | |
5 | 169 | 5 |
If the number of columns are not equal or the column names are different,
you can force the row binding by setting the force
parameter to True
:
DT2["Age"] = dt.Frame([25, 50, 67])
DT2
Weight | Height | Age | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 4 | 180 | 25 | |
1 | 4 | 181 | 50 | |
2 | 5 | 169 | 67 |
dt.rbind(DT1, DT2, force = True)
Weight | Height | Age | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 5 | 170 | NA | |
1 | 4 | 172 | NA | |
2 | 6 | 180 | NA | |
3 | 4 | 180 | 25 | |
4 | 4 | 181 | 50 | |
5 | 5 | 169 | 67 |
See also¶
cbind()
– function for col-binding several frames.dt.Frame.rbind()
– Frame method for rbinding some frames to another.
datatable.repeat()¶
Concatenate n
copies of the frame
by rows and return the result.
This is equivalent to dt.rbind([frame] * n)
.
Example¶
from datatable import dt
DT = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, 5]})
DT
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | NA | |
1 | 1 | 2 | |
2 | 2 | 3 | |
3 | 1 | 4 | |
4 | 2 | 5 |
dt.repeat(DT, 2)
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 1 | NA | |
1 | 1 | 2 | |
2 | 2 | 3 | |
3 | 1 | 4 | |
4 | 2 | 5 | |
5 | 1 | NA | |
6 | 1 | 2 | |
7 | 2 | 3 | |
8 | 1 | 4 | |
9 | 2 | 5 |
datatable.rowall()¶
For each row in cols
return True
if all values in that row are True
,
or otherwise return False
.
Parameters¶
FExpr[bool]
Input boolean columns.
FExpr[bool]
f-expression consisting of one boolean column that has the same number
of rows as in cols
.
TypeError
The exception is raised when one of the columns from cols
has a non-boolean type.
Examples¶
from datatable import dt, f
DT = dt.Frame({"A": [True, True],
"B": [True, False],
"C": [True, True]})
DT
A | B | C | ||
---|---|---|---|---|
bool8 | bool8 | bool8 | ||
0 | 1 | 1 | 1 | |
1 | 1 | 0 | 1 |
DT[:, dt.rowall(f[:])]
C0 | ||
---|---|---|
bool8 | ||
0 | 1 | |
1 | 0 |
datatable.rowany()¶
For each row in cols
return True
if any of the values in that row
are True
, or otherwise return False
. The function uses shortcut
evaluation: if the True
value is found in one of the columns,
then the subsequent columns are skipped.
Parameters¶
FExpr[bool]
Input boolean columns.
FExpr[bool]
f-expression consisting of one boolean column that has the same number
of rows as in cols
.
TypeError
The exception is raised when one of the columns from cols
has a non-boolean type.
Examples¶
from datatable import dt, f
DT = dt.Frame({"A":[True, True],
"B":[True, False],
"C":[True, True]})
DT
A | B | C | ||
---|---|---|---|---|
bool8 | bool8 | bool8 | ||
0 | 1 | 1 | 1 | |
1 | 1 | 0 | 1 |
DT[:, dt.rowany(f[:])]
C0 | ||
---|---|---|
bool8 | ||
0 | 1 | |
1 | 1 |
datatable.rowcount()¶
For each row, count the number of non-missing values in cols
.
Parameters¶
FExpr
Input columns.
FExpr
f-expression consisting of one int32
column and the same number
of rows as in cols
.
Examples¶
from datatable import dt, f
DT = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, None],
"C":[True, False, False, True, True]})
DT
A | B | C | ||
---|---|---|---|---|
int32 | int32 | bool8 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 0 | |
2 | 2 | 3 | 0 | |
3 | 1 | 4 | 1 | |
4 | 2 | NA | 1 |
Note the exclusion of null values in the count:
DT[:, dt.rowcount(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 2 | |
1 | 3 | |
2 | 3 | |
3 | 3 | |
4 | 2 |
datatable.rowfirst()¶
For each row, find the first non-missing value in cols
. If all values
in a row are missing, then this function will also produce a missing value.
Parameters¶
FExpr
Input columns.
FExpr
f-expression consisting of one column and the same number
of rows as in cols
.
TypeError
The exception is raised when input columns have incompatible types.
Examples¶
from datatable import dt, f
DT = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, None],
"C": [True, False, False, True, True]})
DT
A | B | C | ||
---|---|---|---|---|
int32 | int32 | bool8 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 0 | |
2 | 2 | 3 | 0 | |
3 | 1 | 4 | 1 | |
4 | 2 | NA | 1 |
DT[:, dt.rowfirst(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 1 | |
2 | 2 | |
3 | 1 | |
4 | 2 |
DT[:, dt.rowfirst(f['B', 'C'])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 | |
3 | 4 | |
4 | 1 |
datatable.rowlast()¶
For each row, find the last non-missing value in cols
. If all values
in a row are missing, then this function will also produce a missing value.
Parameters¶
Expr
Input columns.
Expr
f-expression consisting of one column and the same number
of rows as in cols
.
TypeError
The exception is raised when input columns have incompatible types.
Examples¶
from datatable import dt, f
DT = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, None],
"C":[True, False, False, True, True]})
DT
A | B | C | ||
---|---|---|---|---|
int32 | int32 | bool8 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 0 | |
2 | 2 | 3 | 0 | |
3 | 1 | 4 | 1 | |
4 | 2 | NA | 1 |
DT[:, dt.rowlast(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 0 | |
2 | 0 | |
3 | 1 | |
4 | 1 |
DT[[1, 3], 'C'] = None
DT
A | B | C | ||
---|---|---|---|---|
int32 | int32 | bool8 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | NA | |
2 | 2 | 3 | 0 | |
3 | 1 | 4 | NA | |
4 | 2 | NA | 1 |
DT[:, dt.rowlast(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 0 | |
3 | 4 | |
4 | 1 |
See Also¶
rowfirst()
– find the first non-missing value row-wise.
datatable.rowmax()¶
For each row, find the largest value among the columns from cols
.
Parameters¶
FExpr
Input columns.
TypeError
The exception is raised when cols
has non-numeric columns.
Examples¶
from datatable import dt, f
DT = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, None],
"C":[True, False, False, True, True]})
DT
A | B | C | ||
---|---|---|---|---|
int32 | int32 | bool8 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 0 | |
2 | 2 | 3 | 0 | |
3 | 1 | 4 | 1 | |
4 | 2 | NA | 1 |
DT[:, dt.rowmax(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 | |
3 | 4 | |
4 | 2 |
datatable.rowmean()¶
For each row, find the mean values among the columns from cols
skipping missing values. If a row contains only the missing values,
this function will produce a missing value too.
Parameters¶
FExpr
Input columns.
TypeError
The exception is raised when cols
has non-numeric columns.
Examples¶
from datatable import dt, f, rowmean
DT = dt.Frame({'a': [None, True, True, True],
'b': [2, 2, 1, 0],
'c': [3, 3, 1, 0],
'd': [0, 4, 6, 0],
'q': [5, 5, 1, 0]}
DT
a | b | c | d | q | ||
---|---|---|---|---|---|---|
bool8 | int32 | int32 | int32 | int32 | ||
0 | NA | 2 | 3 | 0 | 5 | |
1 | 1 | 2 | 3 | 4 | 5 | |
2 | 1 | 1 | 1 | 6 | 1 | |
3 | 1 | 0 | 0 | 0 | 0 |
Get the row mean of all columns:
DT[:, rowmean(f[:])]
C0 | ||
---|---|---|
float64 | ||
0 | 2.5 | |
1 | 3 | |
2 | 2 | |
3 | 0.2 |
Get the row mean of specific columns:
DT[:, rowmean(f['a', 'b', 'd'])]
C0 | ||
---|---|---|
float64 | ||
0 | 1 | |
1 | 2.33333 | |
2 | 2.66667 | |
3 | 0.333333 |
datatable.rowmin()¶
For each row, find the smallest value among the columns from cols
,
excluding missing values.
Parameters¶
FExpr
Input columns.
TypeError
The exception is raised when cols
has non-numeric columns.
Examples¶
from datatable import dt, f
DT = dt.Frame({"A": [1, 1, 2, 1, 2],
"B": [None, 2, 3, 4, None],
"C":[True, False, False, True, True]})
DT
A | B | C | ||
---|---|---|---|---|
int32 | int32 | bool8 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 0 | |
2 | 2 | 3 | 0 | |
3 | 1 | 4 | 1 | |
4 | 2 | NA | 1 |
DT[:, dt.rowmin(f[:])]
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 0 | |
2 | 0 | |
3 | 1 | |
4 | 1 |
datatable.rowsd()¶
For each row, find the standard deviation among the columns from cols
skipping missing values. If a row contains only the missing values,
this function will produce a missing value too.
Parameters¶
FExpr
Input columns.
TypeError
The exception is raised when cols
has non-numeric columns.
Examples¶
from datatable import dt, f, rowsd
DT = dt.Frame({'name': ['A', 'B', 'C', 'D', 'E'],
'group': ['mn', 'mn', 'kl', 'kl', 'fh'],
'S1': [1, 4, 5, 6, 7],
'S2': [2, 3, 8, 5, 1],
'S3': [8, 5, 2, 5, 3]}
DT
name | group | S1 | S2 | S3 | ||
---|---|---|---|---|---|---|
str32 | str32 | int32 | int32 | int32 | ||
0 | A | mn | 1 | 2 | 8 | |
1 | B | mn | 4 | 3 | 5 | |
2 | C | kl | 5 | 8 | 2 | |
3 | D | kl | 6 | 5 | 5 | |
4 | E | fh | 7 | 1 | 3 |
Get the row standard deviation for all integer columns:
DT[:, rowsd(f[int])]
C0 | ||
---|---|---|
float64 | ||
0 | 3.78594 | |
1 | 1 | |
2 | 3 | |
3 | 0.57735 | |
4 | 3.05505 |
Get the row standard deviation for some columns:
DT[:, rowsd(f[2, 3])]
C0 | ||
---|---|---|
float64 | ||
0 | 0.707107 | |
1 | 0.707107 | |
2 | 2.12132 | |
3 | 0.707107 | |
4 | 4.24264 |
datatable.rowsum()¶
For each row, calculate the sum of all values in cols
. Missing values
are treated as if they are zeros and skipped during the calcultion.
Parameters¶
FExpr
Input columns.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
Examples¶
from datatable import dt, f, rowsum
DT = dt.Frame({'a': [1,2,3],
'b': [2,3,4],
'c':['dd','ee','ff'],
'd':[5,9,1]})
DT
a | b | c | d | ||
---|---|---|---|---|---|
int32 | int32 | str32 | int32 | ||
0 | 1 | 2 | dd | 5 | |
1 | 2 | 3 | ee | 9 | |
2 | 3 | 4 | ff | 1 |
DT[:, rowsum(f[int])]
C0 | ||
---|---|---|
int32 | ||
0 | 8 | |
1 | 14 | |
2 | 8 |
DT[:, rowsum(f.a, f.b)]
C0 | ||
---|---|---|
int32 | ||
0 | 3 | |
1 | 5 | |
2 | 7 |
The above code could also be written as
DT[:, f.a + f.b]
C0 | ||
---|---|---|
int32 | ||
0 | 3 | |
1 | 5 | |
2 | 7 |
See Also¶
rowcount()
– count non-missing values row-wise.
datatable.sd()¶
Calculate the standard deviation for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names and number of columns
as in cols
. The column stypes are float32
for
float32
columns, and float64
for all the other numeric types.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
Examples¶
from datatable import dt, f
DT = dt.Frame(A = [0, 1, 2, 3], B = [0, 2, 4, 6])
DT
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 0 | 0 | |
1 | 1 | 2 | |
2 | 2 | 4 | |
3 | 3 | 6 |
Get the standard deviation of column A:
DT[:, dt.sd(f.A)]
A | ||
---|---|---|
float64 | ||
0 | 1.29099 |
Get the standard deviation of columns A and B:
DT[:, dt.sd([f.A, f.B])]
A | B | ||
---|---|---|---|
float64 | float64 | ||
0 | 1.29099 | 2.58199 |
datatable.setdiff()¶
Find the set difference between frame0
and the other frames
.
Each frame should have only a single column or be empty.
The values in each frame will be treated as a set, and this function will
compute the
set difference
between the frame0
and the union of the other
frames, returning those values that are present in the frame0
,
but not present in any of the frames
.
Parameters¶
Frame
Input single-column frame.
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns from the frames
.
Examples¶
from datatable import dt
s1 = dt.Frame([4, 5, 6, 20, 42])
s2 = dt.Frame([1, 2, 3, 5, 42])
s1
C0 | ||
---|---|---|
int32 | ||
0 | 4 | |
1 | 5 | |
2 | 6 | |
3 | 20 | |
4 | 42 |
s2
C0 | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 | |
2 | 3 | |
3 | 5 | |
4 | 42 |
Set difference of the two frames:
dt.setdiff(s1, s2)
C0 | ||
---|---|---|
int32 | ||
0 | 4 | |
1 | 6 | |
2 | 20 |
See Also¶
intersect()
– calculate the set intersection of values in the frames.symdiff()
– calculate the symmetric difference between the sets of values in the frames.union()
– calculate the union of values in the frames.unique()
– find unique values in a frame.
datatable.shift()¶
Produce a column obtained from col
shifting it n
rows forward.
The shift amount, n
, can be both positive and negative. If positive,
a “lag” column is created, if negative it will be a “lead” column.
The shifted column will have the same number of rows as the original
column, with n
observations in the beginning becoming missing, and
n
observations at the end discarded.
This function is group-aware, i.e. in the presence of a groupby it will perform the shift separately within each group.
Examples¶
from datatable import dt, f, by
DT = dt.Frame({"object": [1, 1, 1, 2, 2],
"period": [1, 2, 4, 4, 23],
"value": [24, 67, 89, 5, 23]})
DT
object | period | value | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | 1 | 24 | |
1 | 1 | 2 | 67 | |
2 | 1 | 4 | 89 | |
3 | 2 | 4 | 5 | |
4 | 2 | 23 | 23 |
Shift forward - Create a “lag” column:
DT[:, dt.shift(f.period, n = 3)]
period | ||
---|---|---|
int32 | ||
0 | NA | |
1 | NA | |
2 | NA | |
3 | 1 | |
4 | 2 |
Shift backwards - Create “lead” columns:
DT[:, dt.shift(f[:], n = -3)]
object | period | value | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 2 | 4 | 5 | |
1 | 2 | 23 | 23 | |
2 | NA | NA | NA | |
3 | NA | NA | NA | |
4 | NA | NA | NA |
Shift in the presence of by()
:
DT[:, f[:].extend({"prev_value": dt.shift(f.value)}), by("object")]
object | period | value | prev_value | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 1 | 1 | 24 | NA | |
1 | 1 | 2 | 67 | 24 | |
2 | 1 | 4 | 89 | 67 | |
3 | 2 | 4 | 5 | NA | |
4 | 2 | 23 | 23 | 5 |
datatable.sort()¶
Sort clause for use in Frame’s square-bracket selector.
When a sort()
object is present inside a DT[i, j, ...]
expression, it will sort the rows of the resulting Frame according
to the columns cols
passed as the arguments to sort()
.
When used together with by()
, the sort clause applies after the
group-by, i.e. we sort elements within each group. Note, however,
that because we use stable sorting, the operations of grouping and
sorting are commutative: the result of applying groupby and then sort
is the same as the result of sorting first and then doing groupby.
When used together with i
(row filter), the i
filter is
applied after the sorting. For example:
DT[:10, :, sort(f.Highscore, reverse=True)]
will select the first 10 records from the frame DT
ordered by
the Highscore column.
Examples¶
from datatable import dt, f, by
DT = dt.Frame({"col1": ["A", "A", "B", None, "D", "C"],
"col2": [2, 1, 9, 8, 7, 4],
"col3": [0, 1, 9, 4, 2, 3],
"col4": [1, 2, 3, 3, 2, 1]})
DT
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | A | 2 | 0 | 1 | |
1 | A | 1 | 1 | 2 | |
2 | B | 9 | 9 | 3 | |
3 | NA | 8 | 4 | 3 | |
4 | D | 7 | 2 | 2 | |
5 | C | 4 | 3 | 1 |
Sort by a single column:
DT[:, :, dt.sort("col1")]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | NA | 8 | 4 | 3 | |
1 | A | 2 | 0 | 1 | |
2 | A | 1 | 1 | 2 | |
3 | B | 9 | 9 | 3 | |
4 | C | 4 | 3 | 1 | |
5 | D | 7 | 2 | 2 |
Sort by multiple columns:
DT[:, :, dt.sort("col2", "col3")]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | A | 1 | 1 | 2 | |
1 | A | 2 | 0 | 1 | |
2 | C | 4 | 3 | 1 | |
3 | D | 7 | 2 | 2 | |
4 | NA | 8 | 4 | 3 | |
5 | B | 9 | 9 | 3 |
Sort in descending order:
DT[:, :, dt.sort(-f.col1)]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | NA | 8 | 4 | 3 | |
1 | D | 7 | 2 | 2 | |
2 | C | 4 | 3 | 1 | |
3 | B | 9 | 9 | 3 | |
4 | A | 2 | 0 | 1 | |
5 | A | 1 | 1 | 2 |
The frame can also be sorted in descending order by setting the reverse
parameter to True
:
DT[:, :, dt.sort("col1", reverse=True)]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | NA | 8 | 4 | 3 | |
1 | D | 7 | 2 | 2 | |
2 | C | 4 | 3 | 1 | |
3 | B | 9 | 9 | 3 | |
4 | A | 2 | 0 | 1 | |
5 | A | 1 | 1 | 2 |
By default, when sorting, null values are placed at the top; to relocate null values to the bottom, pass last
to the na_position
parameter:
DT[:, :, dt.sort("col1", na_position="last")]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | A | 2 | 0 | 1 | |
1 | A | 1 | 1 | 2 | |
2 | B | 9 | 9 | 3 | |
3 | C | 4 | 3 | 1 | |
4 | D | 7 | 2 | 2 | |
5 | NA | 8 | 4 | 3 |
Passing remove
to na_position
completely excludes any row with null values from the sorted output:
DT[:, :, dt.sort("col1", na_position="remove")]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | A | 2 | 0 | 1 | |
1 | A | 1 | 1 | 2 | |
2 | B | 9 | 9 | 3 | |
3 | C | 4 | 3 | 1 | |
4 | D | 7 | 2 | 2 |
Sort by multiple columns, descending and ascending order:
DT[:, :, dt.sort(-f.col2, f.col3)]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | B | 9 | 9 | 3 | |
1 | NA | 8 | 4 | 3 | |
2 | D | 7 | 2 | 2 | |
3 | C | 4 | 3 | 1 | |
4 | A | 2 | 0 | 1 | |
5 | A | 1 | 1 | 2 |
The same code above can be replicated by passing a list of booleans to reverse
:
DT[:, :, dt.sort("col2", "col3", reverse=[True, False])]
col1 | col2 | col3 | col4 | ||
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | ||
0 | B | 9 | 9 | 3 | |
1 | NA | 8 | 4 | 3 | |
2 | D | 7 | 2 | 2 | |
3 | C | 4 | 3 | 1 | |
4 | A | 2 | 0 | 1 | |
5 | A | 1 | 1 | 2 |
In the presence of by()
, sort()
sorts within each group:
DT[:, :, by("col4"), dt.sort(f.col2)]
col4 | col1 | col2 | col3 | ||
---|---|---|---|---|---|
int32 | str32 | int32 | int32 | ||
0 | 1 | A | 2 | 0 | |
1 | 1 | C | 4 | 3 | |
2 | 2 | A | 1 | 1 | |
3 | 2 | D | 7 | 2 | |
4 | 3 | NA | 8 | 4 | |
5 | 3 | B | 9 | 9 |
datatable.split_into_nhot()¶
This function is deprecated and will be removed in version 1.1.0.
Please use dt.str.split_into_nhot()
instead.
datatable.sum()¶
Calculate the sum of values for each column from cols
.
Parameters¶
Expr
Input columns.
Expr
f-expression having one row, and the same names and number of columns
as in cols
. The column stypes are int64
for
boolean and integer columns, float32
for float32
columns
and float64
for float64
columns.
TypeError
The exception is raised when one of the columns from cols
has a non-numeric type.
Examples¶
from datatable import dt, f, by
df = dt.Frame({'A': [1, 1, 2, 1, 2],
'B': [None, 2, 3,4, 5],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Get the sum of column A:
df[:, dt.sum(f.A)]
A | ||
---|---|---|
int64 | ||
0 | 7 |
Get the sum of multiple columns:
df[:, [dt.sum(f.A), dt.sum(f.B)]]
A | B | ||
---|---|---|---|
int64 | int64 | ||
0 | 7 | 14 |
Same as above, but more convenient:
df[:, dt.sum(f[:2])]
A | B | ||
---|---|---|---|
int64 | int64 | ||
0 | 7 | 14 |
In the presence of by()
, it returns the sum of the specified columns per group:
df[:, [dt.sum(f.A), dt.sum(f.B)], by(f.C)]
C | A | B | ||
---|---|---|---|---|
int32 | int64 | int64 | ||
0 | 1 | 4 | 7 | |
1 | 2 | 3 | 7 |
datatable.symdiff()¶
Find the symmetric difference between the sets of values in all frames
.
Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the symmetric difference operation on these sets.
The symmetric difference of two frames are those values that are present in either of the frames, but not in the both. The symmetric difference of more than two frames are those values that are present in an odd number of frames.
Parameters¶
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns from the frames
.
ValueError
| NotImplementedError
raised when one of the input frames has more than one column. |
|
raised when one of the columns has stype |
Examples¶
from datatable import dt
df = dt.Frame({'A': [1, 1, 2, 1, 2],
'B': [None, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Symmetric difference of all the columns in the entire frame; Note that each column is treated as a separate frame:
dt.symdiff(*df)
A | ||
---|---|---|
int32 | ||
0 | NA | |
1 | 2 | |
2 | 3 | |
3 | 4 | |
4 | 5 |
Symmetric difference between two frames:
dt.symdiff(df["A"], df["B"])
A | ||
---|---|---|
int32 | ||
0 | NA | |
1 | 1 | |
2 | 3 | |
3 | 4 | |
4 | 5 |
See Also¶
intersect()
– calculate the set intersection of values in the frames.setdiff()
– calculate the set difference between the frames.union()
– calculate the union of values in the frames.unique()
– find unique values in a frame.
datatable.union()¶
Find the union of values in all frames
.
Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the union operation on these sets.
The dt.union(*frames)
operation is equivalent to
dt.unique(dt.rbind(*frames))
.
Parameters¶
Frame
| Frame
| ...
Input single-column frames.
Frame
A single-column frame. The column stype is the smallest common
stype of columns in the frames
.
ValueError
| NotImplementedError
raised when one of the input frames has more than one column. |
|
raised when one of the columns has stype |
Examples¶
from datatable import dt
df = dt.Frame({'A': [1, 1, 2, 1, 2],
'B': [None, 2, 3,4, 5],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Union of all the columns in a frame:
dt.union(*df)
A | ||
---|---|---|
int32 | ||
0 | NA | |
1 | 1 | |
2 | 2 | |
3 | 3 | |
4 | 4 | |
5 | 5 |
Union of two frames:
dt.union(df["A"], df["C"])
A | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 |
See Also¶
intersect()
– calculate the set intersection of values in the frames.setdiff()
– calculate the set difference between the frames.symdiff()
– calculate the symmetric difference between the sets of values in the frames.unique()
– find unique values in a frame.
datatable.unique()¶
Find the unique values in all the columns of the frame
.
This function sorts the values in order to find the uniques, so the return values will be ordered. However, this should be considered an implementation detail: in the future datatable may switch to a different algorithm, such as hash-based, which may return the results in a different order.
Parameters¶
Frame
Input frame.
NotImplementedError
The exception is raised when one of the frame columns has stype obj64
.
Examples¶
from datatable import dt
df = dt.Frame({'A': [1, 1, 2, 1, 2],
'B': [None, 2, 3,4, 5],
'C': [1, 2, 1, 1, 2]})
df
A | B | C | ||
---|---|---|---|---|
int32 | int32 | int32 | ||
0 | 1 | NA | 1 | |
1 | 1 | 2 | 2 | |
2 | 2 | 3 | 1 | |
3 | 1 | 4 | 1 | |
4 | 2 | 5 | 2 |
Unique values in the entire frame:
dt.unique(df)
C0 | ||
---|---|---|
int32 | ||
0 | NA | |
1 | 1 | |
2 | 2 | |
3 | 3 | |
4 | 4 | |
5 | 5 |
Unique values in a frame with a single column:
dt.unique(df["A"])
A | ||
---|---|---|
int32 | ||
0 | 1 | |
1 | 2 |
See Also¶
intersect()
– calculate the set intersection of values in the frames.setdiff()
– calculate the set difference between the frames.symdiff()
– calculate the symmetric difference between the sets of values in the frames.union()
– calculate the union of values in the frames.
datatable.update()¶
Create new or update existing columns within a frame.
This expression is intended to be used at “j” place in DT[i, j]
call. It takes an arbitrary number of key/value pairs each describing
a column name and the expression for how that column has to be
created/updated.
Examples¶
from datatable import dt, f, by, update
DT = dt.Frame([range(5), [4, 3, 9, 11, -1]], names=("A", "B"))
DT
A | B | ||
---|---|---|---|
int32 | int32 | ||
0 | 0 | 4 | |
1 | 1 | 3 | |
2 | 2 | 9 | |
3 | 3 | 11 | |
4 | 4 | -1 |
Create new columns and update existing columns:
DT[:, update(C = f.A * 2,
D = f.B // 3,
A = f.A * 4,
B = f.B + 1)]
DT
A | B | C | D | ||
---|---|---|---|---|---|
int32 | int32 | int32 | int32 | ||
0 | 0 | 5 | 0 | 1 | |
1 | 4 | 4 | 2 | 1 | |
2 | 8 | 10 | 4 | 3 | |
3 | 12 | 12 | 6 | 3 | |
4 | 16 | 0 | 8 | -1 |
Add new column with unpacking; this can be handy for dynamicallly adding columns with dictionary comprehensions, or if the names are not valid python keywords:
DT[:, update(**{"extra column": f.A + f.B + f.C + f.D})]
DT
A | B | C | D | extra column | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 0 | 5 | 0 | 1 | 6 | |
1 | 4 | 4 | 2 | 1 | 11 | |
2 | 8 | 10 | 4 | 3 | 25 | |
3 | 12 | 12 | 6 | 3 | 33 | |
4 | 16 | 0 | 8 | -1 | 23 |
You can update a subset of data:
DT[f.A > 10, update(A = f.A * 5)]
DT
A | B | C | D | extra column | ||
---|---|---|---|---|---|---|
int32 | int32 | int32 | int32 | int32 | ||
0 | 0 | 5 | 0 | 1 | 6 | |
1 | 4 | 4 | 2 | 1 | 11 | |
2 | 8 | 10 | 4 | 3 | 25 | |
3 | 60 | 12 | 6 | 3 | 33 | |
4 | 80 | 0 | 8 | -1 | 23 |
You can also add a new column or update an existing column in a groupby
operation, similar to SQL’s window
operation, or pandas transform()
:
df = dt.Frame("""exporter assets liabilities
False 5 1
True 10 8
False 3 1
False 24 20
False 40 2
True 12 11""")
# Get the ratio for each row per group
df[:,
update(ratio = dt.sum(f.liabilities) * 100 / dt.sum(f.assets)),
by(f.exporter)]
df
exporter | assets | liabilities | ratio | ||
---|---|---|---|---|---|
bool8 | int32 | int32 | float64 | ||
0 | 0 | 5 | 1 | 33.3333 | |
1 | 1 | 10 | 8 | 86.3636 | |
2 | 0 | 3 | 1 | 33.3333 | |
3 | 0 | 24 | 20 | 33.3333 | |
4 | 0 | 40 | 2 | 33.3333 | |
5 | 1 | 12 | 11 | 86.3636 |
Development¶
Contributing¶
datatable
is an open-source project released under the Mozilla Public
License v2. Open source projects live by their user and developer communities.
We welcome and encourage your contributions of any kind!
No matter what your skill set or level of engagement is with datatable
,
you can help others by improving the ecosystem of documentation, bug report
and feature request tickets, and code.
We invite anyone who is interested to contribute, whether through pull requests, tests, GitHub issues, feature suggestions, or even generic discussion.
If you have questions about using datatable
, post them on Stack Overflow
using the [py-datatable]
tag.
Preparing local copy of datatable repository¶
If this is the first time you’re contributing to datatable
, then follow
these steps in order to set up your local development environment:
Make sure you have command-line tools
git
andmake
installed. You should also have a text editor or an IDE of your choice.Go to https://github.com/h2oai/datatable and click the “fork” button in the top right corner. You may need to create a GitHub account if you don’t have one already.
Clone the repository on your local computer:
$ git clone https://github.com/your_user_name/datatable
Lastly, add the original
datatable
repository as the upstream:$ cd datatable $ git remote add upstream https://github.com/h2oai/datatable $ git fetch upstream $ git config branch.main.remote upstream $ git config branch.main.merge refs/heads/main
This completes the setup of your local datatable fork. Make sure to note
the location of the datatable/
directory that you created in step 3.
You will need to return there when issuing any subsequent git
commands
detailed futher.
Creating a contribution¶
Start by fetching any changes that might have occurred since the last time you were working with the repository:
$ git checkout main
$ git pull
Then create a new local branch where you will be working on your changes. The name of the branch should be a short identifier that will help you recognize what this branch is about. It’s a good idea to prefix the branch name with your initials so that it doesn’t conflict with branches from other developers:
$ git checkout -b your_branch_name
After this it is time to make the desired changes to the project. There are
separate guides on how to work with documentation and how to work with core
code changes. It is also a good idea to commit the code frequently, using
git add
and git commit
changes.
Note: While many projects ask for detailed and informative commit messages, we don’t. Our policy is to squash all commits when merging a pull request, and therefore the only detailed message that is needed is the PR description.
When you think your proposed change is ready, verify that everything is in
order by running git status
– it should say “nothing to commit, working
tree clean”. At this point the changes need to be pushed into the “origin”,
which is your repository fork:
$ git push origin your_branch_name
Then go back to the GitHub website to your fork of the datatable repository
https://github.com/your_user_name/datatable. There you should see a pop-up
that notifies about the changes pushed to your_branch_name
. There will also
be a green button “Compare & pull request”. Pressing that button you will see
an “Open a pull request” form.
When opening a pull request, make sure to provide an informative title and a detailed description of the proposed changes. If the pull request directly addresses one of the issues, make sure to note that in the text of the PR description.
Make sure the checkbox “Allow edits by maintainers” is turned on, and then press “Create pull request”.
At this point your Pull Request will be scheduled for review at the main datatable repository. Once reviewed, you may be asked to change something, in which case you can make the necessary modifications locally, then commit and push them.
Contributing documentation¶
The documentation for datatable
project is written entirely in the
ReStructured Text (RST) format and rendered using the Sphinx engine. These
technologies are standard for Python.
The basic workflow for developing documentation, after
setting up a local datatable repository, is to go into
the docs/
directory and run
$ make html
After that, if there were no errors, the documentation can be viewed locally
by opening the file docs/_build/html/index.html
in a browser.
The make html
command needs to be re-run after every change you make.
Occasionally you may also need to make clean
if something doesn’t seem to
work properly.
Basic formatting¶
At the most basic level, RST document is a plain text, where paragraphs are separated with empty lines:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in
voluptate velit esse cillum dolore eu fugiat nulla pariatur.
The line breaks within each paragraph are ignored; on the rendered page the lines will be as wide as is necessary to fill the page. With that in mind, we ask to avoid lines that are longer than 80 characters in width, if possible. This makes it much easier to work with code on small screen devices.
Page headings are a simple line of underlined text:
Heading Level 1
===============
Heading Level 2
---------------
Heading Level 3
~~~~~~~~~~~~~~~
Each document must have exactly one level-1 heading; otherwise the page would not render properly.
Basic bold text, italic text and literal text
is written as follows.
(Note that literals use double backticks, which is a frequent cause of
formatting errors.):
**bold text**
*italic text*
``literal text``
Bulleted and ordered lists are done similarly to Markdown:
- list item 1;
- list item 2;
- a longer list item, that might need to be
be carried over to the next line.
1. ordered list item 1
2. ordered list item 2
This is the next paragraph of list item 2.
The content of each list item can be arbitrarily complex, as long as it is properly indented.
Code blocks¶
There are two main ways to format a block of code. The simplest way is to finish
a paragraph with a double-colon ::
and then start the next paragraph (code)
indented with 4 spaces:
Here is a code example::
>>> print("Hello, world!", flush=True)
In this case the code will be highlighted assuming it is a python sample. If the
code corresponds to some other language, you’ll need to use an explicit
code-block
directive:
.. code-block:: shell
$ pip install datatable
This directive allows you to explicitly select the language of your code snippet,
which will affect how it is highlighted. The code inside code-block
must be
indented, and there has to be an empty line between the .. code-block::
declaration and the actual code.
When writing python code examples, the best practice is to use python console
format, i.e. prepend all input lines with >>>
(or ...
for continuation
lines), and keep all output lines without a prefix. When documenting an error,
remove all traceback information and leave only the error message:
>>> import datatable as dt
>>> DT = dt.Frame(A=[5], B=[17], D=['zelo'])
>>> DT
| A B D
| int32 int32 str32
-- + ----- ----- -----
0 | 5 17 zelo
[1 row x 3 columns]
>>> DT.hello()
AttributeError: 'datatable.Frame' object has no attribute 'hello'
This code snippet will be rendered as follows:
import datatable as dt
DT = dt.Frame(A=[5], B=[17], D=['zelo'])
DT
A | B | D | ||
---|---|---|---|---|
int32 | int32 | str32 | ||
0 | 5 | 17 | zelo |
DT.hello()
Hyperlinks¶
If you want to create a link to an external website, then you would do it in two parts. First, the text of the link should be surrounded by backticks and followed by an underscore. Second, the URL of the link is declared at the bottom of the page (or section) via a special directive:
Say you want to create a link to an `example website`_. The text "example
website" in the previous sentence will be turned into a link, whose URL is
declared somewhere later on the page.
And here we declare the target (the backticks are actually optional):
.. _`example website`: https://example.com/
If you want to create a link to another page or section of this documentation, then it is done similarly: first you create the target, then refer to that target within the document.
Creating the target is done similarly to how we declared an external URL, only this time you simply omit the URL. The RST engine will then assume that the target points to the following element on the page (which should usually be a section heading, an image, a table, etc):
.. _`hello world`:
Hello world example
~~~~~~~~~~~~~~~~~~~
Then you can refer to this target the same way that you referred to an external
URL in the previous example. However, this would only work if you refer to this
anchor within the same page. If you want to refer to this anchor within another
rst document, then you would need to use the :ref:
role:
We can refer to "hello world example" even from a different document
like this: :ref:`hello world`. Also, you can use the following syntax to
refer to the same anchor but change its description text:
:ref:`the simplest program <hello world>`.
Lastly, there are also special auto-generated targets in the API Reference
part of the documentation. These targets describe each class, function, method,
and other exported symbols of the datatable
module. In order to refer to
these targets, special syntax is used:
:mod:`datatable`
:class:`datatable.Frame`
:meth:`datatable.Frame.to_csv`
:func:`datatable.fread`
which will be rendered as datatable
, dt.Frame
,
dt.Frame.to_csv()
, dt.fread()
.
The “renamed link” syntax can also be used:
:func:`fread(input) <datatable.fread>`
If repeating the datatable.
part is tedious, then you can add the following
declaration at the top of the page:
.. py:currentmodule:: datatable
Note that some of these links may render in red. It means the documentation for
the referenced function/class/object is missing and still needs to be added:
datatable.missing_function()
.
Advanced directives¶
All rst documents are arranged into a tree. All non-leaf nodes of this tree
must include a .. toctree::
directive, which may also be declared hidden:
.. toctree::
:hidden:
child_doc_1
Explicit name <child_doc_2>
The .. image::
directive can be used to insert an image, which may also be
a link:
.. image:: <image URL>
:target: <target URL if the image is a link>
In order to note that some functionality was added or changed in a specific version, use:
.. x-version-added:: 0.10.0
.. x-version-deprecated:: 1.0.0
.. x-version-changed:: 0.11.0
Here's what changed: blah-blah-blah
The .. seealso::
directive adds a Wikipedia-style “see also:” entry at the
beginning of a section. The argument of this directive should contain a link
to the content that you want the user to see. This directive is best to include
immeditately after a heading:
.. seealso:: :ref:`columnsets`
Directive .. x-comparison-table::
allows to create a two-column table
specifically designed for comparing two entities across multiple comparison
points. It is primarily used to create the “compare datatable with another
library” manual pages. The content of this directive is comprised of multiple
“sections” separated with ====
, and each section has 2 or 3 parts
(separated with ----
): an optional common header, then the content of the
first column, and then the second:
.. x-comparison-table::
:header1: datatable
:header2: another-library
Section 1 header
----
Column 1
----
Column 2
====
Section 2 header
----
Column 1
----
Column 2
Changelog support¶
RST is language that supports extensions. One of the custom extensions that we
use supports maintaining a changelog. First, the .. changelog::
directive
which is used in releases/vN.N.N.rst
files declares that each of those
files describes a particular release of datatable. The format is as follows:
.. changelog::
:version: <version number>
:released: <release date>
:wheels: URL1
URL2
etc.
changelog content...
.. contributors::
N @username <full name>
--
N @username <full name>
The effect of this declaration is the following:
The title of the page is automatically inserted, together with an anchor that can be used to refer to this page;
A Wikipedia-style infobox is added on the right side of the page. This infobox contains the release date, links to the previous/next release, and the links to all wheels that where released at that version. The wheels are grouped by the python version / operating system. An sdist link may also be included as one of the “wheels”.
Within the
.. changelog::
directive, a special form of list items is supported:-[new] New feature that was added -[enh] Improvement of an existing feature or function -[fix] Bug fix -[api] API change
In addition, if any such item ends with the text of the form
[#333]
, then this will be automatically converted into a link to a github issue/PR with that number.The
.. contributors::
directive can only be used inside a changelog, and it should list the contributors who participated in creation of this particular release. The list of contributors is prepared using the scriptci/gh.py
Documenting API¶
When it comes to documenting specific functions/classes/methods of the
datatable
module, we use another extension: .. xfunction::
(or
.. xclass::
, .. xmethod::
, etc). This is because this part of the
documentation is declared within the C++ code, so that it can be available
from within a regular python session.
Inside the documentation tree, each function/method/etc that has to be documented is declared as follows:
.. xfunction:: datatable.rbind
:src: src/core/frame/rbind.cc py_rbind
:doc: src/core/frame/rbind.cc doc_py_rbind
:tests: tests/munging/test-rbind.py
Here we declare the function dt.rbind()
, whose source code is
located in file src/core/frame/rbind.cc
in function py_rbind()
. The
docstring of this function is located in the same file in a variable
static const char* doc_py_rbind
. The content of the latter variable will
be pre-processed and then rendered as RST. The :doc:
parameter is optional,
if omitted the directive will attempt to find the docstring automatically.
The optional :tests:
parameter should point to a file where the tests for
this function are located. This will be included as a link in the rendered
output.
In order to document a getter/setter property of a class, use the following:
.. xdata:: datatable.Frame.key
:src: src/core/frame/key.cc Frame::get_key Frame::set_key
:doc: src/core/frame/key.cc doc_key
:tests: tests/test-keys.py
:settable: new_key
:deletable:
The :src:
parameter can now accept two function names: the getter and the
setter. In addition, the :settable:
parameter will have the name of the setter
value as it will be displayed in the docs. Lastly, :deletable:
marks this
class property as deletable.
The docstring of the function/method/etc is preprocessed before it is rendered into the RST document. This processing includes the following steps:
The “Parameters” section is parsed and the definitions of all function parameters are extracted.
The contents of the “Examples” section are parsed as if it was a literal block, converting from python-console format into the format jupyter-style code blocks. In addition, if the output of any command contains a datatable Frame, it will also be converted into a Jupyter-style table.
All other sections are displayed as-is.
Here’s an example of a docstring:
static const char* doc_rbind =
R"(rbind(self, *frames, force=False, bynames=True)
--
Append rows of `frames` to the current frame.
This method modifies the current frame in-place. If you do not want
the current frame modified, then use the :func:`dt.rbind()` function.
Parameters
----------
frames: Frame | List[Frame]
One or more frames to append.
force: bool
If True, then the frames are allowed to have mismatching set of
columns. Any gaps in the data will be filled with NAs.
bynames: bool
If True (default), the columns in frames are matched by their
names, otherwise by their order.
Examples
--------
>>> DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0])
>>> frame1 = dt.Frame(A=[-1], B=[None])
>>> DT.rbind(frame1)
>>> DT
| A B
-- + -- --
0 | 1 4
1 | 2 7
2 | 3 0
3 | -1 NA
--
[4 rows x 2 columns]
)";
Creating a new FExpr¶
The majority of functions available from datatable
module are implemented
via the FExpr
mechanism. These functions have the same common API: they
accept one or more FExpr
s (or fexpr-like objects) as arguments and
produce an FExpr
as the output. The resulting FExpr
s can then be used
inside the DT[...]
call to apply these expressions to a particular frame.
In this document we describe how to create such FExpr
-based function. In
particular, we describe adding the gcd(a, b)
function for computing the
greatest common divisor of two integers.
C++ “backend” class¶
The core of the functionality will reside within a class derived from the
class dt::expr::FExpr
. So let’s create the file expr/fexpr_gcd.cc
and
declare the skeleton of our class:
#include "expr/fexpr_func.h"
#include "expr/eval_context.h"
#include "expr/workframe.h"
namespace dt {
namespace expr {
class FExpr_Gcd : public FExpr_Func {
private:
ptrExpr a_;
ptrExpr b_;
public:
FExpr_Gcd(ptrExpr&& a, ptrExpr&& b)
: a_(std::move(a)), b_(std::move(b)) {}
std::string repr() const override;
Workframe evaluate_n(EvalContext& ctx) const override;
};
}}
In this example we are inheriting from FExpr_Func
, which is a slightly more
specialized version of FExpr
.
You can also see that the two arguments in gcd(a, b)
are stored within the
class as ptrExpr a_, b_
. This ptrExpr
is actually a typedef for
std::shared_ptr<FExpr>
, which means that arguments to our FExpr
are
also FExpr
s.
The first method that needs to be implemented is repr()
, which is
more-or-less equivalent to python’s __repr__
. The returned string should
not have the name of the class in it, instead it must be ready to be combined
with reprs of other expressions:
std::string repr() const override {
std::string out = "gcd(";
out += a_->repr();
out += ", ";
out += b_->repr();
out += ')';
return out;
}
We construct our repr out of reprs of a_
and b_
. They are joined with
a comma, which has the lowest precedence in python. For some other FExprs we
may need to take into account the precedence of the arguments as well, in
order to properly set up parentheses around subexpressions.
The second method to implement is evaluate_n()
. The _n
suffix here
stands for “normal”. If you look into the source of FExpr
class, you’ll see
that there are other evaluation methods too: evaluate_i()
, evaluate_j()
,
etc. However all of those are not needed when implementing a simple function.
The method evaluate_n()
takes an EvalContext
object as the argument.
This object contains information about the current evaluation environment. The
output from evaluate_n()
should be a Workframe
object. A workframe can
be thought of as a “work-in-progress” frame. In our case it is sufficient to
treat it as a simple vector of columns.
We begin implementing evaluate_n()
by evaluating the arguments a_
and
b_
and then making sure that those frames are compatible with each other
(i.e. have the same number of columns and rows). After that we compute the
result by iterating through the columns of both frames and calling a simple
method evaluate1(Column&&, Column&&)
(that we still need to implement):
Workframe evaluate_n(EvalContext& ctx) const override {
Workframe awf = a_->evaluate_n(ctx);
Workframe bwf = b_->evaluate_n(ctx);
if (awf.ncols() == 1) awf.repeat_column(bwf.ncols());
if (bwf.ncols() == 1) bwf.repeat_column(awf.ncols());
if (awf.ncols() != bwf.ncols()) {
throw TypeError() << "Incompatible number of columns in " << repr()
<< ": the first argument has " << awf.ncols() << ", while the "
<< "second has " << bwf.ncols();
}
awf.sync_grouping_mode(bwf);
auto gmode = awf.get_grouping_mode();
Workframe outputs(ctx);
for (size_t i = 0; i < awf.ncols(); ++i) {
Column rescol = evaluate1(awf.retrieve_column(i),
bwf.retrieve_column(i));
outputs.add_column(std::move(rescol), std::string(), gmode);
}
return outputs;
}
The method evaluate1()
will take a pair of two columns and produce
the output column containing the result of gcd(a, b)
calculation. We must
take into account the stypes of both columns, and decide which stypes are
acceptable for our function:
Column evaluate1(Column&& a, Column&& b) const {
SType stype1 = a.stype();
SType stype2 = b.stype();
SType stype0 = common_stype(stype1, stype2);
switch (stype0) {
case SType::BOOL:
case SType::INT8:
case SType::INT16:
case SType::INT32: return make<int32_t>(std::move(a), std::move(b), SType::INT32);
case SType::INT64: return make<int64_t>(std::move(a), std::move(b), SType::INT64);
default:
throw TypeError() << "Invalid columns of types " << stype1 << " and "
<< stype2 << " in " << repr();
}
}
template <typename T>
Column make(Column&& a, Column&& b, SType stype0) const {
a.cast_inplace(stype0);
b.cast_inplace(stype0);
return Column(new Column_Gcd<T>(std::move(a), std::move(b)));
}
As you can see, the job of the FExpr_Gcd
class is to produce a workframe
containing one or more Column_Gcd
virtual columns. This is where the actual
calculation of GCD values will take place, and we shall declare this class too.
It can be done either in a separate file in the core/column/ folder, or
inside the current file expr/fexpr_gcd.cc.
#include "column/virtual.h"
template <typename T>
class Column_Gcd : public Virtual_ColumnImpl {
private:
Column acol_;
Column bcol_;
public:
Column_Gcd(Column&& a, Column&& b)
: Virtual_ColumnImpl(a.nrows(), a.stype()),
acol_(std::move(a)), bcol_(std::move(b))
{
xassert(acol_.nrows() == bcol_.nrows());
xassert(acol_.stype() == bcol_.stype());
xassert(acol_.can_be_read_as<T>());
}
ColumnImpl* clone() const override {
return new Column_Gcd(Column(acol_), Column(bcol_));
}
size_t n_children() const noexcept { return 2; }
const Column& child(size_t i) { return i==0? acol_ : bcol_; }
bool get_element(size_t i, T* out) {
T a, b;
bool avalid = acol_.get_element(i, &a);
bool bvalid = bcol_.get_element(i, &b);
if (avalid && bvalid) {
while (b) {
T tmp = b;
b = a % b;
a = tmp;
}
*out = a;
return true;
}
return false;
}
};
Python-facing gcd()
function¶
Now that we have created the FExpr_Gcd
class, we also need to have a python
function responsible for creating these objects. This is done in 4 steps:
First, declare a function with signature py::oobj(const py::XArgs&)
. The
py::XArgs
object here encapsulates all parameters that were passed to the
function, and it returns a py::oobj
, which is a simple wrapper around
python’s PyObject*
.
static py::oobj py_gcd(const py::XArgs& args) {
auto a = args[0].to_oobj();
auto b = args[1].to_oobj();
return PyFExpr::make(new FExpr_Gcd(as_fexpr(a), as_fexpr(b)));
}
This function takes the python arguments, if necessary validates and converts
them into C++ objects, then creates a new FExpr_Gcd
object, and then
returns it wrapped into a PyFExpr
(which is a python equivalent of the
generic FExpr
class).
In the second step, we declare the signature and the docstring of this python function:
DECLARE_PYFN(&py_gcd)
->name("gcd")
->docs(dt::doc_gcd)
->arg_names({"a", "b"})
->n_positional_args(2)
->n_required_args(2);
The variable doc_gcd
must be declared in the common “documentation.h”
file:
extern const char* doc_gcd;
The actual documentation should be written in a separate .rst
file (more
on this later), and then it will be added into the code during the compilation
stage via the auto-generated file “documentation.cc”.
At this point the method will be visible from python in the _datatable
module. So the next step is to import it into the main datatable
module.
To do this, go to src/datatable/__init__.py
and write
from .lib._datatable import (
...
gcd,
...
)
...
__all__ = (
...
"gcd",
...
)
Tests¶
Any functionality must be properly tested. We recommend creating a dedicated
test file for each new function. Thus, create file tests/expr/test-gcd.py
and add some tests in it. We use the pytest
framework for testing. In this
framework, each test is a single function (whose name starts with test_
)
which performs some actions and then asserts the validity of results.
import pytest
import random
from datatable import dt, f, gcd
from tests import assert_equals # checks equality of Frames
from math import gcd as math_gcd
def test_equal_columns():
DT = dt.Frame(A=[1, 2, 3, 4, 5])
RES = DT[:, gcd(f.A, f.A)]
assert_equals(RES, dt.Frame([1, 1, 1, 1, 1]/dt.int32))
@pytest.mark.parametrize("seed", [random.getrandbits(63)])
def test_random(seed):
random.seed(seed)
n = 100
src1 = [random.randint(1, 1000) for i in range(n)]
src2 = [random.randint(1, 100) for i in range(n)]
DT = dt.Frame(A=src1, B=src2)
RES = DT[:, gcd(f.A, f.B)]
assert_equals(RES, dt.Frame([math_gcd(src1[i], src2[i])
for i in range(n)]))
When writing tests try to test any corner cases that you can think of. For example, what if one of the numbers is 0? Negative? Add tests for various column types, including invalid ones.
Documentation¶
The final piece of the puzzle is the documentation. We’ve already created
variable doc_gcd
earlier, which will ensure that documentation will
be visible from python when you run help(gcd)
. However, the primary
place where people look for documentation is on a dedicated readthedocs
website, and this is where we will be adding the actual content.
So, create file docs/api/dt/gcd.rst
. The content of the file could
be something like this:
.. xfunction:: datatable.gcd
:src: src/core/fexpr/fexpr_gcd.cc py_gcd
:tests: tests/expr/test-gcd.py
:cvar: doc_gcd
:signature: gcd(a, b)
Compute the greatest common divisor of `a` and `b`.
Parameters
----------
a, b: FExpr
Only integer columns are supported.
return: FExpr
The returned column will have stype int64 if either `a` or `b` are
of type int64, or otherwise it will be int32.
In these lines we declare:
the name of the function which provides the gcd functionality (this is presented to the user as the “src” link in the generated docs);
the name of the file dedicated to testing this functionality, this will also become a link in the generated documentation;
the name of the C variable declared in “documentation.h” which should be given a copy of the documentation, so that it can be embedded into python;
the main signature of the function: its name and parameters (with defaults if necessary).
This RST file now needs to be added to the toctree: open the file
docs/api/index-api.rst
and add it into the .. toctree::
list at the
bottom, and also add it to the table of all functions.
Lastly, open docs/releases/v{LATEST}.rst
(this is our changelog) and write
a brief paragraph about the new function:
Frame
-----
...
-[new] Added new function :func:`gcd()` to compute the greatest common
divisor of two columns. [#NNNN]
The [#NNNN]
is a link to the GitHub issue where the gcd()
function
was requested.
Submodules¶
Some functions are declared within submodules of the datatable module. For
example, math-related functions can be found in dt.math
, string functions
in dt.str
, etc. Declaring such functions is not much different from what
is described above. For example, if we wanted our gcd()
function to be
in the dt.math
submodule, we’d made the following changes:
Create file
expr/math/fexpr_gcd.cc
instead ofexpr/fexpr_gcd.cc
;Instead of importing the function in
src/datatable/__init__.py
we’d have imported it fromsrc/datatable/math.py
;The test file name can be
tests/math/test-gcd.py
instead oftests/expr/test-gcd.py
;The doc file name can be
docs/api/math/gcd.rst
instead ofdocs/api/dt/gcd.rst
, and it should be added to the toctree indocs/api/math.rst
.
Test page¶
This is a test page, it has no useful content. It’s sole purpose is to collect various elements of markup to ensure that they render properly in the current theme. For developers we recommend to visually check this page after any tweak to the accompanying CSS files.
If you notice any visual irregularities somewhere else within the documentation, please add those examples to this file as a kind of “manual test”.
Inline markup¶
Bold text is not actually corageous, it merely looks thicker.
Partially bold text;
Italic text is still English, except the latters are slanted.
Literal text
is no more literal than any other text, but uses the monospace font.ABC
s, orABC
’s?Ctrl+Alt+Del is a keyboard shortcut (
:kbd:
role)subscript text can be used if you need to go low (
:sub:
)superscript text but if they go low, we go high! (
:sup:
)label may come in handy too (
:guilabel:
)
The smartquotes
Sphinx plugin is responsible for converting “straight”
quotes (""
) into “typographic” quotes (“”
). Similarly for the ‘single’
quotes (''
into ‘’
). Don’t forget about single quotes in the middle of
a word, or at the end’ of a word. Lastly, double-dash (--
) should be
rendered as an n-dash – like this, and triple-dash (---
) as an m-dash
— like this.
Hyperlinks may come in a variety of different flavors. In particular, links leading outside of this website must have clear indicator that they are external. The internal links should not have such an indicator:
A plain hyperlink: https://www.example.com/
A labeled hyperlink: example website
Link to a PEP document: PEP 574 (use
:PEP:
role)Link to a target within this page: Bill of Rights
Link to a target on another page: join(…)
Links to API objects:
Frame
,fread()
,dt.math.tau
Headers¶
This section is dedicated to headers of various levels. Note that there can be only one level-1 header (at the top of the page). All other headers are therefore level-2 or smaller. At the same time, headers below level 4 are not supported.
Sub-header A¶
Paragraph within a subheader. Please check that the spacing between the headers and the text looks reasonable.
Sub-sub header A.1¶
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Sub-sub header A.2¶
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Sub-header B¶
Nothing to see here, move along citizen.
Lists¶
Embedding lists into text may be somewhat tricky, for example here’s the list
that contains a short enumeration of items. It is supposed to be rendered in
a “compact” style (.simple
in CSS):
one
two
three
Same list, but ordered:
one
two
three
Finally, a more complicated list that is still considered “simple” by docutils
(see SimpleListChecker
: a list is simple if every list item contains either
a single paragraph, or a paragraph followed by a simple list). Here we exhibit
four variants of the same list, altering ordered/unordered property:
|
|
|
|
Compare this to the following list, which is supposed to be rendered with more spacing between the elements:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Vitae purus faucibus ornare suspendisse sed. Sit amet mauris commodo quis imperdiet. Id velit ut tortor pretium viverra suspendisse potenti nullam ac. Enim eu turpis egestas pretium aenean.
Neque laoreet suspendisse interdum consectetur libero. Tellus elementum sagittis vitae et leo duis ut. Vel pretium lectus quam id leo. Eget nunc scelerisque viverra mauris in. Integer enim neque volutpat ac tincidunt vitae semper quis lectus. Urna molestie at elementum eu facilisis sed.
Molestie at elementum eu facilisis sed. Nisi vitae suscipit tellus mauris a diam maecenas sed enim. Morbi tincidunt ornare massa eget egestas. Condimentum lacinia quis vel eros. Viverra accumsan in nisl nisi scelerisque. Lorem sed risus ultricies tristique. Phasellus egestas tellus rutrum tellus pellentesque eu tincidunt tortor aliquam. Semper feugiat nibh sed pulvinar. Quis hendrerit dolor magna eget est lorem ipsum dolor.
Amet commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Pellentesque elit eget gravida cum sociis natoque. Sit amet risus nullam eget felis eget nunc lobortis mattis.
Tellus rutrum tellus pellentesque eu tincidunt tortor. Eget arcu dictum varius duis. Eleifend mi in nulla posuere sollicitudin aliquam ultrices sagittis orci.
Ut ornare lectus sit amet est placerat in. Leo urna molestie at elementum. At auctor urna nunc id. Risus at ultrices mi tempus imperdiet nulla malesuada.
The next section demonstrates how different kinds of lists nest within each other.
Bill of Rights¶
The Conventions of a number of the States having at the time of their adopting the Constitution, expressed a desire, in order to prevent misconstruction or abuse of its powers, that further declaratory and restrictive clauses should be added: And as extending the ground of public confidence in the Government, will best insure the beneficent ends of its institution
Resolved by the Senate and House of Representatives of the United States of America, in Congress assembled, two thirds of both Houses concurring, that the following Articles be proposed to the Legislatures of the several States, as Amendments to the Constitution of the United States, all or any of which Articles, when ratified by three fourths of the said Legislatures, to be valid to all intents and purposes, as part of the said Constitution; viz.:
Articles in addition to, and Amendment of the Constitution of the United States of America, proposed by Congress, and ratified by the Legislatures of the several States, pursuant to the fifth Article of the original Constitution.
Congress shall make no law respecting
an establishment of religion, or prohibiting the free exercise thereof; or
abridging the freedom of speech, or of the press; or
the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.
A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.
No Soldier shall, in time of peace be quartered in any house, without the consent of the Owner, nor in time of war, but in a manner to be prescribed by law.
The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.
No person shall be
held to answer for a capital, or otherwise infamous crime, unless on a presentment or indictment of a Grand Jury, except in cases arising in the land or naval forces, or in the Militia, when in actual service in time of War or public danger; nor shall any person be
subject for the same offence to be twice put in jeopardy of life or limb; nor shall be
compelled in any criminal case to be a witness against himself, nor be
deprived of
life,
liberty, or
property,
without due process of law;
nor shall private property be taken for public use, without just compensation.
In all criminal prosecutions, the accused shall enjoy the right to a speedy and public trial, by an impartial jury of the State and district wherein the crime shall have been committed, which district shall have been previously ascertained by law, and to be informed of the nature and cause of the accusation; to be confronted with the witnesses against him; to have compulsory process for obtaining witnesses in his favor, and to have the Assistance of Counsel for his defence.
In Suits at common law, where the value in controversy shall exceed twenty dollars, the right of trial by jury shall be preserved, and no fact tried by a jury, shall be otherwise re-examined in any Court of the United States, than according to the rules of the common law.
Excessive bail shall not be required, nor excessive fines imposed, nor cruel and unusual punishments inflicted.
The enumeration in the Constitution, of certain rights, shall not be construed to deny or disparage others retained by the people.
The powers not delegated to the United States by the Constitution, nor prohibited by it to the States, are reserved to the States respectively, or to the people.
Code samples¶
Literal block after a paragraph. The spacing between this text and the code block below should be small, similar to regular spacing between lines:
import datatable as dt
DT = dt.Frame(A = [3, 1, 4, 1, 5])
DT.shape
repr(DT)
# This is how a simple frame would be rendered:
DT
A | ||
---|---|---|
0 | 3 | |
1 | 1 | |
2 | 4 | |
3 | 1 | |
4 | 5 |
DT + DT
This is a paragraph after the code block. The spacing should be roughly the same as between regular paragraphs.
And here’s an example with a keyed frame:
DT = dt.Frame({"A": [1, 2, 3, 4, 5],
"B": [4, 5, 6, 7, 8],
"C": [7, 8, 9, 10, 11],
"D": [5, 7, 2, 9, -1],
"E": ['a','b','c','d','e']})
DT.key = ['E', 'D']
DT
E | D | A | B | C | |
---|---|---|---|---|---|
str32 | int32 | int32 | int32 | int32 | |
a | 5 | 1 | 4 | 7 | |
b | 7 | 2 | 5 | 8 | |
c | 2 | 3 | 6 | 9 | |
d | 9 | 4 | 7 | 10 | |
e | -1 | 5 | 8 | 11 |
The following is a test for multi-line output from code samples:
for i in range(5):
print(1/(4 - i))
The following is a plain piece of python code (i.e. without input/output sections):
#!/usr/bin/python
import everything as nothing
class MyClass(object):
r"""
Just some sample code
"""
def __init__(self, param1, param2):
assert isinstance(param1, str)
self.x = param1.lower() + "!!!"
self.y = param2
@classmethod
def enjoy(cls, item):
print(str(cls) + " likes " + item)
if __name__ == "__main__":
data = [MyClass('abc', 2)]
data += [1, 123, -14, +297, 2_300_000]
data += [True, False, None]
data += [0x123, 0o123, 0b111]
data += [2.71, 1.23e+45, -1.0001e-11, -math.inf]
data += ['abc', "def", """ghijk""", b"lmnop"]
data += [f"This is an f-string {len(data)}.\n"]
data += [r"\w+\n?\x0280\\ [abc123]+$",
"\w+\n?\x0280\\ [abc123]+$",]
data += [..., Ellipsis]
# This cannot happen:
if data and not data:
assert AssertionError
Languages other than python are also supported. For example, the following
is a shell code sample (console
“language”):
$ # list all files in a directory
$ ls -l
total 48
-rw-r--r-- 1 pasha staff 804B Dec 10 09:14 Makefile
drwxr-xr-x 24 pasha staff 768B Dec 10 09:14 api/
-rw-r--r-- 1 pasha staff 7.1K Dec 11 14:10 conf.py
drwxr-xr-x 7 pasha staff 224B Dec 10 09:14 develop/
-rw-r--r-- 1 pasha staff 62B Jul 29 14:02 docutils.conf
-rw-r--r-- 1 pasha staff 4.2K Dec 10 09:14 index.rst
drwxr-xr-x 12 pasha staff 384B Dec 10 09:14 manual/
drwxr-xr-x 19 pasha staff 608B Dec 10 13:19 releases/
drwxr-xr-x 6 pasha staff 192B Dec 10 13:19 start/
$ # Here are some more advanced syntax elements:
$ echo "PYTHONHOME = $PYTHONHOME"
$ export TMP=/tmp
$ docker run -it --init -v `pwd`:/cwd ${DOCKER_CONTAINER}
RST code sample:
Heading
+++++++
Da-da-da-daaaaa, **BOOM**.
Da-da-da-da-da-dA-da-da-da-da-da-da-da-da,
DA-DA-DA-DA, DA-DA DA Da, BOOM,
DAA-DA-DA-*DAAAAAAAAAAAA*, ty-dum
- item 1 (:func:`foo() <dt.foo>`);
- item 2;
- ``item 3`` is a :ref:`reference`.
.. custom-directive-here:: not ok
:option1: value1
:option2: value2
Here be dragons
.. _`plain target`:
There could be other |markup| elements too, such as `links`_, and
various inline roles, eg :sup:`what's up!`. Plain URLs probably
won't be highlighted: http://datatable.com/.
..
just a comment |here|
.. yep:: still a comment
.. _`example website`: https://example.com/
C++ code sample:
#include <cstring>
int main() {
return 0;
}
SQL code sample:
SELECT * FROM students WHERE name='Robert'; DROP TABLE students; --'
Special care must be taken in case the code in the samples has very long line lengths. Generally, the code should not overflow its container block, or make the page wider than normal. Instead, the code block should get a horizontal scroll bar:
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Extra Sunday", "Fourth Weekend"]
print(days * 2)
Same, but for a non-python code:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Admonitions¶
First, the .. note::
block, which should display in a prominent box:
Note
Please pay attention!
There’s usually some other content coming after the note. It should be properly spaced from the admonition.
Note
Here’s a note with more content. Why does it have so much content? Nobody knows. In theory admonitions should be short and to the point, but this one is not playing by the rules. It just goes on and on and on and on, and it seems like it would never end. Even as you think that maybe at last the end is near, a new paragraph suddenly appears:
And that new paragraph just repeats the same nonsense all over again. Really, there is no any good reason for it to keep going, but it does nevertheless, as if trying to stretch the limits of how many words can be spilled without any of them making any sense.
Note
First, this is a note with a list
Second, it may contain several list items
Third is here just to make things look more impressive
Fourth is the last one (or is it?)
X-versions¶
The ..x-version-added
directive usually comes as a first item after a
header, so it has reduced margins from the top, and has to have adequate
margins on the bottom to compensate.
The ..x-version-changed
directive is a paragraph-level, and it usually has
some additional content describing what exactly has changed.
Nobody knows what exactly changed, but most observers agree that something did.
While we’re trying to figure out what it was, please visit the release notes (linked above) and see if it makes sense to you.
Release History¶
Version 0.2.1¶
Version 0.2.1 | |
---|---|
Release date: | 2017-09-11 |
General¶
Created the CHANGELOG file.
sys.getsizeof(DT)
can now be used to query the size of the datatable in memory.Added a framework for computing and storing per-column summary statistics.
Implemented statistics
min
,max
,mean
,stdev
,countna
for numeric and boolean columns.Getter
df.internal.rowindex
allows access to the RowIndex on the DataTable (for inspection/reuse).In addition to LLVM4 environmental variable, datatable will now also look for the
llvm4
folder within the package’s directory.If
d0
is a DataTable, thend1 = DataTable(d0)
will create its shallow copy.Environmental variable
DTNOOPENMP
will cause thedatatable
to be built without OpenMP support.Filter function when applied to a view DataTable now produces correct result.
Contributors¶
This page lists all people who have contributed to the development of
datatable
. We take into account both code and documentation contributions,
as well as contributions in the form of bug reports and feature requests.
More specifically, a code contribution is considered any PR (pull request) that was merged into the codebase. The “complexity” of the PR is not taken into account as it is highly subjective. Next, an issue contribution is any closed issue except for those that are tagged as “question”, “wont-fix” or “cannot-reproduce”. Issues are attributed according to their closing date, not their creation date.
In the table, the contributors are sorted according to their total contribution score, which is the weighted sum of the count of each user’s code and issue contributions. Code contributions have more weight than issue contributions, and more recent contributions more weight than the older ones.
Developer’s note: This table is auto-generated based on contributor lists
in each of the version files, specified via the ..contributors::
directive.
In turn, the list of contributors for each version has to be generated via
the script ci/gh.py
at the time of each release. The issues/PRs will be
filtered according to their milestone. Thus, the issues/PRs that are not tagged
with any milestone will not be taken into account.