Datatable is a python library for manipulating tabular data. It supports out-of-memory datasets, multi-threaded data processing, and flexible API.

Getting Started

Installation

This page describes how to install datatable on various systems.

Prerequisites

Python 3.6+ is required. Generally, we will support each version of Python until its official end of life. You can verify your python version via

$ python --version Python 3.6.6

In addition, we recommend using pip version 20.3+, especially if you’re planning to install datatable from the source, or if you are on a Unix machine.

$ pip install pip --upgrade Collecting pip Using cached pip-21.1.2-py3-none-any.whl (1.5 MB) Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 20.2.4 Uninstalling pip-20.2.4: Successfully uninstalled pip-20.2.4 Successfully installed pip-21.1.2

There are no other prerequisites. Datatable does not depend on any other python module 1, nor on any non-standard system library.

Basic installation

On most platforms datatable can be installed directly from PyPI using pip:

$ pip install datatable

The following platforms are supported:

  • macOS

    Datatable has been tested to work on macOS 10.12.5 (Sierra), macoS 10.13.6 (High Sierra), macOS 10.15.7 (Catalina), and macOS 11.2.3 (BigSur). The produced wheels are tagged as macosx_10_9, so they should work on earlier versions of macOS as well.

  • Linux x86_64 / ppc64le

    We produce binary wheels that are tagged as manylinux_2_12 (for x86_64 architecture) and manylinux2014 (for ppc64le). Consequently, they will work with your Linux distribution if it is compatible with one of these tags. Please refer to PEP 600 for details.

  • Windows

    Windows wheels are available for Windows 10 or later.

Install latest dev version

If you wish to test the latest version of datatable before it has been officially released, then you can use one of the binary wheels that we build as part of our Continuous Integration process.

If you are on Windows, then pre-built wheels are available on AppVeyor. Click on a green main build of your choice, then navigate to the “Artifacts” tab, copy the wheel URL that corresponds to your Python version, and finally install it as:

C:\> pip install YOUR_WHEEL_URL

For macOS and Linux, development wheels can be found at our S3 repository. Scroll to the bottom of the page to find the latest links, and then download or copy the URL of a wheel that corresponds to your Python version and platform. This wheel can be installed with pip as usual:

$ pip install YOUR_WHEEL_URL

Alternatively, you can instruct pip to go to that repository directly and find the latest version automatically:

$ pip install --trusted-host h2o-release.s3-website-us-east-1.amazonaws.com \ -i http://h2o-release.s3-website-us-east-1.amazonaws.com/ datatable

Build from source

In order to build and install the latest development version of datatable directly from GitHub, run the following command:

$ pip install git+https://github.com/h2oai/datatable

Since datatable is written mostly in C++, your computer must be set up for compiling C++ code. The build script will attempt to find the compiler automatically, searching for GCC, Clang, or MSVC on Windows. If it fails, or if you want to use some other compiler, then set environment variable CXX before building the code.

Datatable uses C++14 language standard, which means you must use the compiler that fully implements this standard. The following compiler versions are known to work:

  • Clang 5+;

  • GCC 6+;

  • MSVC 19.14+.

Install datatable in editable mode

If you want to tweak certain features of datatable, or even add your own functionality, you are welcome to do so. This section describes how to install datatable for development process.

  1. First, you need to fork the repository and then clone it locally:

    $ git clone https://github.com/your_user_name/datatable $ cd datatable
  2. Build _datatable core library. The two most common options are:

    $ # build a "production mode" datatable $ make build $ # build datatable in "debug" mode, without optimizations and with $ # internal asserts enabled $ make debug

    Note that you would need to have a C++ compiler in order to compile and link the code. Please refer to the previous section for compiler requirements.

    On macOS you may also need to install Xcode Command Line Tools.

    On Linux if you see an error that 'Python.h' file not found, then it means you need to install a “development” version of Python, i.e. the one that has python header files included.

  3. After the previous step succeeds, you will have a _datatable.*.so file in the src/datatable/lib folder. Now, in order to make datatable usable from Python, run

    $ echo "`pwd`/src" >> ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth

    (This assumes that you are using a virtualenv-based python. If not, then you’ll need to adjust the path to your python’s site-packages directory).

  4. Install additional libraries that are needed to test datatable:

    $ pip install -r requirements_tests.txt $ pip install -r requirements_extra.txt $ pip install -r requirements_docs.txt
  5. Check that everything works correctly by running the test suite:

    $ make test

Once these steps are completed, subsequent development process is much simpler. After any change to C++ files, re-run make build (or make debug) and then restart python for the changes to take effect.

Datatable only recompiles those files that were modified since the last time, which means that usually the compile step takes only few seconds. Also note that you can switch between the “build” and “debug” versions of the library without performing make clean.

Troubleshooting

Despite our best effort to keep the installation process hassle-free, sometimes problems may still arise. Here we list some of the more frequent ones, where we know how to resolve them. If none of these help you, please ask a question on StackOverflow (tagging with [py-datatable]), or file an issue on GitHub.

pip._vendor.pep517.wrappers.BackendUnavailable

This error occurs when you have an old version of pip in your environment. Please upgrade pip to the version 20.3+, and the error should disappear.

ImportError: cannot import name '_datatable'

This means the internal core library _datatable.*.so is either missing entirely, is in a wrong location, or has the wrong name. The first step is therefore to find where that file actually is. Use the system find tool, limiting the search to your python directory.

If the file is missing entirely, then it was either deleted, or installation used a broken wheel file. In either case, the only solution is to rebuild or reinstall the library completely.

If the file is present but not within the site-packages/datatable/lib/ directory, then moving it there should solve the issue.

If the file is present and is in the correct directory, then there must be a name conflict. In python run:

import sysconfig sysconfig.get_config_var("SOABI")
'cpython-36m-ppc64le-linux-gnu'

The reported suffix should match the suffix of the _datatable.*.so file. If it doesn’t, then renaming the file will fix the problem.

Python.h: no such file or directory when compiling from source

Your Python distribution was shipped without the Python.h header file. This has been observed on certain Linux machines. You would need to install a Python package with a -dev suffix, for example python3.6-dev.

fatal error: 'sys/mman.h' file not found on macOS

In order to compile from source on mac computers, you need to have Xcode Command Line Tools installed. Run:

$ xcode-select --install
ImportError: This package should not be accessible

The most likely cause of this error is a misconfigured PYTHONPATH environment variable. Unset that variable and try again.

Footnotes

1

Since version v0.11.0

Getting started

Install datatable

Let’s begin by installing the latest stable version of datatable from PyPI:

$ pip install datatable

If this didn’t work for you, or if you want to install the bleeding edge version of the library, please check the Installation page.

Assuming the installation was successful, you can now import the library in a JupyterLab notebook or in a Python console:

import datatable as dt print(dt.__version__)
1.0.0

Loading data

The fundamental unit of analysis in datatable is a data Frame. It is the same notion as a pandas DataFrame or SQL table: data arranged in a two-dimensional array with rows and columns.

You can create a Frame object from a variety of data sources: from a python list or dictionary, from a numpy array, or from a pandas DataFrame:

DT1 = dt.Frame(A=range(5), B=[1.7, 3.4, 0, None, -math.inf], stypes={"A": dt.int64}) DT2 = dt.Frame(pandas_dataframe) DT3 = dt.Frame(numpy_array)

You can also load a CSV/text/Excel file, or open a previously saved binary .jay file:

DT4 = dt.fread("~/Downloads/dataset_01.csv") DT5 = dt.open("data.jay")

The fread() function shown above is both powerful and extremely fast. It can automatically detect parse parameters for the majority of text files, load data from .zip archives or URLs, read Excel files, and much more.

Data manipulation

Once the data is loaded into a Frame, you may want to do certain operations with it: extract/remove/modify subsets of the data, perform calculations, reshape, group, join with other datasets, etc. In datatable, the primary vehicle for all these operations is the square-bracket notation inspired by traditional matrix indexing but overcharged with power (this notation was pioneered in R data.table and is the main axis of intersection between these two libraries).

In short, almost all operations with a Frame can be expressed as

DT[i, j, ...]

where i is the row selector, j is the column selector, and ... indicates that additional modifiers might be added. If this looks familiar to you, that’s because it is. Exactly the same DT[i, j] notation is used in mathematics when indexing matrices, in C/C++, in R, in pandas, in numpy, etc. The only difference that datatable introduces is that it allows i to be anything that can conceivably be interpreted as a row selector: an integer to select just one row, a slice, a range, a list of integers, a list of slices, an expression, a boolean-valued Frame, an integer-valued Frame, an integer numpy array, a generator, and so on.

The j column selector is even more versatile. In the simplest case, you can select just a single column by its index or name. But also accepted are a list of columns, a slice, a string slice (of the form "A":"Z"), a list of booleans indicating which columns to pick, an expression, a list of expressions, and a dictionary of expressions. (The keys will be used as new names for the columns being selected.) The j expression can even be a python type (such as int or dt.float32), selecting all columns matching that type.

In addition to the selector expression shown above, we support the update and delete statements too:

DT[i, j] = r del DT[i, j]

The first expression will replace values in the subset [i, j] of Frame DT with the values from r, which could be either a constant, or a suitably-sized Frame, or an expression that operates on frame DT.

The second expression deletes values in the subset [i, j]. This is interpreted as follows: if i selects all rows, then the columns given by j are removed from the Frame; if j selects all columns, then the rows given by i are removed; if neither i nor j span all rows/columns of the Frame, then the elements in the subset [i, j] are replaced with NAs.

What the f.?

You may have noticed already that we mentioned several times the possibility of using expressions in i or j and in other places. In the simplest form an expression looks like:

f.ColA

which indicates a column ColA in some Frame. Here f is a variable that has to be imported from the datatable module. This variable provides a convenient way to reference any column in a Frame. In addition to the notation above, the following is also supported:

f[3] f["ColB"]

denoting the fourth column and the column ColB respectively.

These f-expression support arithmetic operations as well as various mathematical and aggregate functions. For example, in order to select the values from column A normalized to range [0; 1] we can write the following:

from datatable import f, min, max DT[:, (f.A - min(f.A))/(max(f.A) - min(f.A))]

This is equivalent to the following SQL query:

SELECT (f.A - MIN(f.A))/(MAX(f.A) - MIN(f.A)) FROM DT AS f

So, what exactly is f? We call it a “frame proxy”, as it becomes a simple way to refer to the Frame that we currently operate on. More precisely, whenever DT[i, j] is evaluated and we encounter an f-expression there, that f becomes replaced with the frame DT, and the columns are looked up on that Frame. The same expression can later on be applied to a different Frame, and it will refer to the columns in that other Frame.

At some point you may notice that that datatable also exports symbol g. This g is also a frame proxy; however it already refers to the second frame in the evaluated expression. This second frame appears when you are joining two or more frames together (more on that later). When that happens, symbol g is used to refer to the columns of the joined frame.

Groupbys/joins

In the Data Manipulation section we mentioned that the DT[i, j, ...] selector can take zero or more modifiers, which we denoted as “...”. The available modifiers are by(), join() and sort(). Thus, the full form of the square-bracket selector is:

DT[i, j, by(), sort(), join()]

by(…)

This modifier splits the frame into groups by the provided column(s), and then applies i and j within each group. This mostly affects aggregator functions such as sum(), min() or sd(), but may also apply in other circumstances. For example, if i is a slice that takes the first 5 rows of a frame, then in the presence of the by() modifier it will take the first 5 rows of each group.

For example, in order to find the total amount of each product sold, write:

from datatable import f, by, sum DT = dt.fread("transactions.csv") DT[:, sum(f.quantity), by(f.product_id)]

sort(…)

This modifier controls the order of the rows in the result, much like SQL clause ORDER BY. If used in conjunction with by(), it will order the rows within each group.

join(…)

As the name suggests, this operator allows you to join another frame to the current, equivalent to the SQL JOIN operator. Currently we support only left outer joins.

In order to join frame X, it must be keyed. A keyed frame is conceptually similar to a SQL table with a unique primary key. This key may be either a single column, or several columns:

X.key = "id"

Once a frame is keyed, it can be joined to another frame DT, provided that DT has the column(s) with the same name(s) as the key in X:

DT[:, :, join(X)]

This has the semantics of a natural left outer join. The X frame can be considered as a dictionary, where the key column contains the keys, and all other columns are the corresponding values. Then during the join each row of DT will be matched against the row of X with the same value of the key column, and if there are no such value in X, with an all-NA row.

The columns of the joined frame can be used in expressions using the g. prefix, for example:

DT[:, sum(f.quantity * g.price), join(products)]

Note

In the future, we will expand the syntax of the join operator to allow other kinds of joins and also to remove the limitation that only keyed frames can be joined.

Offloading data

Just as our work has started with loading some data into datatable, eventually you will want to do the opposite: store or move the data somewhere else. We support multiple mechanisms for this.

First, the data can be converted into a pandas DataFrame or into a numpy array. (Obviously, you have to have pandas or numpy libraries installed.):

DT.to_pandas() DT.to_numpy()

A frame can also be converted into python native data structures: a dictionary, keyed by the column names; a list of columns, where each column is itself a list of values; or a list of rows, where each row is a tuple of values:

DT.to_dict() DT.to_list() DT.to_tuples()

You can also save a frame into a CSV file, or into a binary .jay file:

DT.to_csv("out.csv") DT.to_jay("data.jay")

Using datatable

This section describes common functionality and commands that you can run in datatable.

Create Frame

You can create a Frame from a variety of sources, including numpy arrays, pandas DataFrames, raw Python objects, etc:

import datatable as dt import numpy as np np.random.seed(1) dt.Frame(np.random.randn(1000000))
C0
float64
01.62435
1-0.611756
2-0.528172
3-1.07297
40.865408
5-2.30154
61.74481
7-0.761207
80.319039
9-0.24937
101.46211
11-2.06014
12-0.322417
13-0.384054
141.13377
9999950.0595784
9999960.140349
999997-0.596161
9999981.18604
9999990.313398
import pandas as pd pf = pd.DataFrame({"A": range(1000)}) dt.Frame(pf)
A
int64
00
11
22
33
44
55
66
77
88
99
1010
1111
1212
1313
1414
995995
996996
997997
998998
999999
dt.Frame({"n": [1, 3], "s": ["foo", "bar"]})
ns
int32str32
01foo
13bar

Convert a Frame

Convert an existing Frame into a numpy array, a pandas DataFrame, or a pure Python object:

nparr = DT.to_numpy() pddfr = DT.to_pandas() pyobj = DT.to_list()

Parse Text (csv) Files

datatable provides fast and convenient parsing of text (csv) files:

DT = dt.fread("train.csv")

The datatable parser

  • Automatically detects separators, headers, column types, quoting rules, etc.

  • Reads from file, URL, shell, raw text, archives, glob

  • Provides multi-threaded file reading for maximum speed

  • Includes a progress indicator when reading large files

  • Reads both RFC4180-compliant and non-compliant files

Write the Frame

Write the Frame’s content into a csv file (also multi-threaded):

DT.to_csv("out.csv")

Save a Frame

Save a Frame into a binary format on disk, then open it later instantly, regardless of the data size:

DT.to_jay("out.jay") DT2 = dt.open("out.jay")

Basic Frame Properties

Basic Frame properties include:

print(DT.shape) # (nrows, ncols) print(DT.names) # column names print(DT.stypes) # column types

Compute Per-Column Summary Stats

Compute per-column summary stats using:

DT.sum() DT.max() DT.min() DT.mean() DT.sd() DT.mode() DT.nmodal() DT.nunique()

Select Subsets of Rows/Columns

Select subsets of rows and/or columns using:

DT[:, "A"] # select 1 column DT[:10, :] # first 10 rows DT[::-1, "A":"D"] # reverse rows order, columns from A to D DT[27, 3] # single element in row 27, column 3 (0-based)

Delete Rows/Columns

Delete rows and or columns using:

del DT[:, "D"] # delete column D del DT[f.A < 0, :] # delete rows where column A has negative values

Filter Rows

Filter rows via an expression using the following. In this example, mean, sd, f are all symbols imported from datatable:

DT[(f.x > mean(f.y) + 2.5 * sd(f.y)) | (f.x < -mean(f.y) - sd(f.y)), :]

Compute Columnar Expressions

Compute columnar expressions using:

DT[:, {"x": f.x, "y": f.y, "x+y": f.x + f.y, "x-y": f.x - f.y}]

Sort Columns

Sort columns using:

DT.sort("A") DT[:, :, sort(f.A)]

Perform Groupby Calculations

Perform groupby calculations using:

DT[:, mean(f.x), by("y")]

Append Rows/Columns

Append rows/columns to a Frame using Frame.cbind():

DT1.cbind(DT2, DT3) DT1.rbind(DT4, force=True)

User Guide

Name mangling

Column names in a Frame satisfy several invariants:

  • they are all non-empty strings;

  • within a single Frame column names must be unique;

  • no column name may contain characters from the ASCII C0 control block. This set of forbidden characters includes: the NULL character \0, TAB character \t, newline \n, and similar.

If the user makes an attempt to create a Frame that would violate some of these assumptions, then instead of failing we will attempt to mangle the provided names, forcing them to satisfy the above requirements.

Given a list of column names requested by the user, the following algorithm is used:

  1. First, we check all the non-empty names in the list, from left to right. If a name contains characters in the range \x00-\x1F, then every run of 1 or more such characters is replaced with a single dot.

  2. Once the special characters are removed from the name, we check it against the set of names that were already encountered. If the current name hasn’t been seen before, then we add it to the final list of names and proceed to consider the next name in the list. However, if the name was seen before, then it goes into the deduplication stage.

  3. When a name needs to be deduplicated, we do the following:

    • If the name ends with a number, then split it into two parts: the stem and the numeric suffix. Let count be the value of the numeric suffix plus 1;

    • If the name does not end with a number, then append a dot (.) to the name and consider this the stem. For the count variable, take the value of option dt.options.frame.name_auto_index.

    • Concatenate stem and count, and check whether this name has been seen before. If it was, then increment count by 1, and repeat this step.

    • Use stem + count as this column’s final name. Continue processing other columns.

  4. Finally, re-scan the list of column names once again, this time replacing all the empty names. For each empty name we proceed exactly as in (3), using dt.options.frame.name_auto_prefix as the stem, and dt.options.frame.name_auto_index as the initial count.

Examples

The default value of dt.options.frame.name_auto_prefix is "C", and the default value of dt.options.frame.name_auto_index is 0. This means that if no column names are given, they will be named as C0, C1, C2, ...:

dt.Frame([[]] * 5).names
('C0', 'C1', 'C2', 'C3', 'C4')

If the column names contain duplicates, then they will gain a numeric suffix (or reuse the existing suffix, if any):

dt.Frame(names=["A", "A", "A"]).names
('A', 'A.0', 'A.1')
dt.Frame(names=["R3"] * 4).names
('R3', 'R4', 'R5', 'R6')

If some of the column names are given, while others are missing, then the missing names will be filled as C0, C1, ...:

dt.Frame(names=["A", None, "B", None]).names
('A', 'C0', 'B', 'C1')

When replacing the missing names, explicitly given names will have a higher precedence and tend to retain their names:

dt.Frame(names=["A", None, "C0", "C1"]).names
('A', 'C2', 'C0', 'C1')

However, deduplication of the existing names happen from left to right, which may affect the subsequent columns:

dt.Frame(names=["A1", "A1", "A2", "A3"]).names
('A1', 'A2', 'A3', 'A4')

f-expressions

The datatable module exports a special symbol f, which can be used to refer to the columns of a frame currently being operated on. If this sounds cryptic, consider that the most common way to operate on a frame is via the square-bracket call DT[i, j, by, ...]. It is often the case that within this expression you would want to refer to individual columns of the frame: either to create a filter, a transform, or specify a grouping variable, etc. In all such cases the f symbol is used, and it is considered to be evaluated within the context of the frame DT.

For example, consider the expression:

f.price

By itself, it just means a column named “price”, in an unspecified frame. This expression becomes concrete, however, when used with a particular frame. For example:

train_dt[f.price > 0, :]

selects all rows in train_dt where the price is positive. Thus, within the call to train_dt[...], the symbol f refers to the frame train_dt.

The standalone f-expression may occasionally be useful too: it can be saved in a variable and then re-applied to several different frames. Each time f will refer to the frame to which it is being applied:

price_filter = (f.price > 0) train_filtered = train_dt[price_filter, :] test_filtered = test_dt[price_filter, :]

The simple expression f.price can be saved in a variable too. In fact, there is a Frame helper method .export_names() which does exactly that: returns a tuple of variables for each column name in the frame, allowing you to omit the f. prefix:

Id, Price, Quantity = DT.export_names() DT[:, [Id, Price, Quantity, Price * Quantity]]

Single-column selector

As you have seen, the expression f.NAME refers to a column called “NAME”. This notation is handy, but not universal. What do you do if the column’s name contains spaces or unicode characters? Or if a column’s name is not known, only its index? Or if the name is in a variable? For these purposes f supports the square-bracket selectors:

f[-1] # select the last column f["Price ($)"] # select column names "Price ($)"

Generally, f[i] means either the column at index i if i is an integer, or the column with name i if i is a string.

Using an integer index follows the standard Python rule for list subscripts: negative indices are interpreted as counting from the end of the frame, and requesting a column with an index outside of [-ncols; ncols) will raise an error.

This square-bracket form is also useful when you want to access a column dynamically, i.e. if its name is not known in advance. For example, suppose there is a frame with columns "2017_01", "2017_02", …, "2019_12". Then all these columns can be addressed as:

[f["%d_%02d" % (year, month)] for month in range(1, 13) for year in [2017, 2018, 2019]]

Multi-column selector

In the previous section you have seen that f[i] refers to a single column when i is either an integer or a string. However we alo support the case when i is a slice or a type:

f[:] # select all columns f[::-1] # select all columns in reverse order f[:5] # select the first 5 columns f[3:4] # select the fourth column f["B":"H"] # select columns from B to H, inclusive f[int] # select all integer columns f[float] # select all floating-point columns f[dt.str32] # select all columns with stype `str32` f[None] # select no columns (empty columnset)

In all these cases a columnset is returned. This columnset may contain a variable number of columns or even no columns at all, depending on the frame to which this f-expression is applied.

Applying a slice to symbol f follows the same semantics as if f was a list of columns. Thus f[:10] means the first 10 columns of a frame, or all columns if the frame has less than 10. Similarly, f[9:10] selects the 10th column of a frame if it exists, or nothing if the frame has less than 10 columns. Compare this to selector f[9], which also selects the 10th column of a frame if it exists, but throws an exception if it doesn’t.

Besides the usual numeric ranges, you can also use name ranges. These ranges include the first named column, the last named column, and all columns in between. It is not possible to mix positional and named columns in a range, and it is not possible to specify a step. If the range is x:y, yet column x comes after y in the frame, then the columns will be selected in the reverse order: first x, then the column preceding x, and so on, until column y is selected last:

f["C1":"C9"] # Select columns from C1 up to C9 f["C9":"C1"] # Select columns C9, C8, C7, ..., C2, C1 f[:"C3"] # Select all columns up to C3 f["C5":] # Select all columns after C5

Finally, you can select all columns of a particular type by using that type as an f-selector. You can pass either common python types bool, int, float, str; or you can pass an stype such as dt.int32, or an ltype such as dt.ltype.obj. You can also pass None to not select any columns. By itself this may not be very useful, but occasionally you may need this as a fallback in conditional expressions:

f[int if select_types == "integer" else float if select_types == "floating" else None] # otherwise don't select any columns

A columnset can be used in situations where a sequence of columns is expected, such as:

  • the j node of DT[i,j,...];

  • within by() and sort() functions;

  • with certain functions that operate on sequences of columns: rowsum(), rowmean, rowmin, etc;

  • many other functions that normally operate on a single column will automatically map over all columns in columnset:

    sum(f[:]) # equivalent to [sum(f[i]) for i in range(DT.ncols)] f[:3] + f[-3:] # same as [f[0]+f[-3], f[1]+f[-2], f[2]+f[-1]]
Added in version 0.10.0

Modifying a columnset

Columnsets support operations that either add or remove elements from the set. This is done using methods .extend() and .remove().

The .extend() method takes a columnset as an argument (also a list, or dict, or sequence of columns) and produces a new columnset containing both the original and the new columns. The columns need not be unique: the same column may appear multiple times in a columnset. This method allows to add transformed columns into the columnset as well:

f[int].extend(f[float]) # integer and floating-point columns f[:3].extend(f[-3:]) # the first and the last 3 columns f.A.extend(f.B) # columns "A" and "B" f[str].extend(dt.str32(f[int])) # string columns, and also all integer # columns converted to strings # All columns, and then one additional column named 'cost', which contains # column `price` multiplied by `quantity`: f[:].extend({"cost": f.price * f.quantity})

When a columnset is extended, the order of the elements is preserved. Thus, a columnset is closer in functionality to a python list than to a set. In addition, some of the elements in a columnset can have names if the columnset is created from a dictionary. The names may be non-unique too.

The .remove() method is the opposite of .extend(): it takes an existing columnset and then removes all columns that are passed as the argument:

f[:].remove(f[str]) # all columns except columns of type string f[:10].remove(f.A) # the first 10 columns without column "A" f[:].remove(f[3:-3]) # same as `f[:3].extend(f[-3:])`, at least in the # context of a frame with 6+ columns

Removing a column that is not in the columnset is not considered an error, similar to how set-difference operates. Thus, f[:].remove(f.A) may be safely applied to a frame that doesn’t have column “A”: the columns that cannot be removed are simply ignored.

If a columnset includes some column several times, and then you request to remove that column, then only the first occurrence in the sequence will be removed. Generally, the multiplicity of some column “A” in columnset cs1.remove(cs2) will be equal to the multiplicity of “A” in cs1 minus the multiplicity of “A” in cs2, or 0 if such difference would be negative. Thus,:

f[:].extend(f[int]).remove(f[int])

will have the effect of moving all integer columns to the end of the columnset (since .remove() removes the first occurrence of a column it finds).

It is not possible to remove a transformed column from a columnset. An error will be thrown if the argument of .remove() contains any transformed columns.

Added in version 0.10.0

Fread Examples

This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.

See fread() for all the available parameters.

Note: If you wish to read in multiple files, use iread(); it returns an iterator of Frames.

Read data

Read from a text file:

from datatable import dt, fread fread('iris.csv')
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
105.43.71.50.2setosa
114.83.41.60.2setosa
124.831.40.1setosa
134.331.10.1setosa
145.841.20.2setosa
1456.735.22.3virginica
1466.32.551.9virginica
1476.535.22virginica
1486.23.45.42.3virginica
1495.935.11.8virginica

Read text data directly:

data = ('col1,col2,col3\n' 'a,b,1\n' 'a,b,2\n' 'c,d,3') fread(data)
col1col2col3
str32str32int32
0ab1
1ab2
2cd3

Read from a url:

url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv" fread(url)
yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
int32int32int32int32int32str32str32str32int32int32int32
02014111413AAJFKLAX35924759
1201411-313AAJFKLAX363247511
220141129AAJFKLAX351247519
3201411-8-26AALGAPBI15710357
420141121AAJFKLAX350247513
520141140AAEWRLAX339245418
6201411-2-18AAJFKLAX338247521
7201411-3-14AAJFKLAX356247515
8201411-1-17AAJFKMIA161108915
9201411-2-14AAJFKSEA349242218
10201411-5-17AAEWRMIA161108516
112014117-5AAJFKSFO365258617
1220141131AAJFKBOS3918712
13201411142133AAJFKLAX345247519
14201411-5-26AAJFKBOS3518717
253311201410311-30UALGAIAH201141614
25331220141031-5-14UAEWRIAH18914008
25331320141031-816MQLGARDU8343111
25331420141031-415MQLGADTW7550211
25331520141031-51MQLGASDF1106598

Read from an archive (if there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in):

fread("data.zip/mtcars.csv")

Note: Use iread() if you wish to read in multiple files in an archive; an iterator of Frames is returned.

Read from .xls or .xlsx files

fread("excel.xlsx")

For excel files, you can specify the sheet to be read:

fread("excel.xlsx/Sheet1")
Note:
  • xlrd must be installed to read in excel files.

  • Use iread() if you wish to read in multiple sheets; an iterator of Frames is returned.

Read in data from the command line. Simply pass the command line statement to the cmd parameter:

# https://blog.jpalardy.com/posts/awk-tutorial-part-2/ # You specify the `cmd` parameter # Here we filter data for the year 2015 fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")

The command line can be very handy with large data; you can do some of the preprocessing before reading in the data to datatable.

Detect Thousand Separator

Fread handles thousand separator, with the assumption that the separator is a ,:

fread("""Name|Salary|Position James|256,000|evangelist Ragnar|1,000,000|conqueror Loki|250360|trickster""")
NameSalaryPosition
str32int32str32
0James256000evangelist
1Ragnar1000000conqueror
2Loki250360trickster

Specify the Delimiter

You can specify the delimiter via the sep parameter. Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]:

data = """ 1:2:3:4 5:6:7:8 9:10:11:12 """
>>> >>> fread(data, sep=":")
C0C1C2C3
int32int32int32int32
01234
15678
29101112

Dealing with Null Values and Blank Rows

You can pass a list of values to be treated as null, via the na_strings parameter:

data = """ ID|Charges|Payment_Method 634-VHG|28|Cheque 365-DQC|33.5|Credit card 264-PPR|631|-- 845-AJO|42.3| 789-KPO|56.9|Bank Transfer """ fread(data, na_strings=['--', ''])
IDChargesPayment_Method
str32float64str32
0634-VHG28Cheque
1365-DQC33.5Credit card
2264-PPR631NA
3845-AJO42.3NA
4789-KPO56.9Bank Transfer

For rows with less values than in other rows, you can set fill=True; fread will fill with NA:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11') fread(data, fill=True)
abcd
int32int32int32int32
01234
15678
291011NA

You can skip empty lines:

data = ('a,b,c,d\n' '\n' '1,2,3,4\n' '5,6,7,8\n' '\n' '9,10,11,12') fread(data, skip_blank_lines=True)
abcd
int32int32int32int32
01234
15678
29101112

Dealing with Column Names

If the data has no headers, fread will assign default column names:

data = ('1,2\n' '3,4\n') fread(data)
C0C1
int32int32
012
134

You can pass in column names via the columns parameter:

fread(data, columns=['A','B'])
AB
int32int32
012
134

You can change column names:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, columns=["A","B","C","D"])
ABCD
int32int32int32int32
01234
15678
29101112

You can change some of the column names via a dictionary:

fread(data, columns={"a":"A", "b":"B"})
ABcd
int32int32int32int32
01234
15678
29101112

Fread uses heuristics to determine whether the first row is data or not; occasionally it may guess incorrectly, in which case, you can set the header parameter to False:

fread(data, header=False)
C0C1C2C3
str32str32str32str32
0abcd
11234
25678
39101112

You can pass a new list of column names as well:

fread(data, header=False, columns=["A","B","C","D"])
ABCD
str32str32str32str32
0abcd
11234
25678
39101112

Row Selection

Fread has a skip_to_line parameter, where you can specify what line to read the data from:

data = ('skip this line\n' 'a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_line=2)
abcd
int32int32int32int32
01234
15678
29101112

You can also skip to a line containing a particular string with the skip_to_string parameter, and start reading data from that line. Note that skip_to_string and skip_to_line cannot be combined; you can only use one:

data = ('skip this line\n' 'a,b,c,d\n' 'first, second, third, last\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_string='first')
firstsecondthirdlast
int32int32int32int32
01234
15678
29101112

You can set the maximum number of rows to read with the max_nrows parameter:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, max_nrows=2)
abcd
int32int32int32int32
01234
15678
data = ('skip this line\n' 'a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_line=2, max_nrows=2)
abcd
int32int32int32int32
01234
15678

Setting Column Type

You can determine the data types via the columns parameter:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # this is useful when you are interested in only a subset of the columns fread(data, columns={"a":dt.float32, "b":dt.str32})
abcd
float64str32int32int32
01234
15678
29101112

You can also pass in the data types by position:

fread(data, columns = [dt.int32, dt.str32, None, dt.float32])
abd
int32str32float64
0124
1568
291012

You can also change all the column data types with a single assignment:

fread(data, columns = dt.float32)
abcd
float64float64float64float64
01234
15678
29101112

You can change the data type for a slice of the columns (here slice(3) is equivalent to [:3]):

# this changes the data type to float for the first three columns fread(data, columns={float:slice(3)})
abcd
float64float64float64int32
01234
15678
29101112

Selecting Columns

There are various ways to select columns in fread :

  • Select with a dictionary:

    data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # pass ``Ellipsis : None`` or ``... : None``, # to discard any columns that are not needed fread(data, columns={"a":"a", ... : None})
    a
    int32
    01
    15
    29

Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.

  • Select columns with a set:

    fread(data, columns={"a","b"})
    ab
    int32int32
    012
    156
    2910
  • Select range of columns with slice:

    # select the second and third column fread(data, columns=slice(1,3))
    bc
    int32int32
    023
    167
    21011
    # select the first column # jump two hoops and # select the third column fread(data, columns = slice(None,3,2))
    ac
    int32int32
    013
    157
    2911
  • Select range of columns with range:

    fread(data, columns = range(1,3))
    bc
    int32int32
    023
    167
    21011
  • Boolean Selection:

    fread(data, columns=[False, False, True, True])
    cd
    int32int32
    034
    178
    21112
  • Select with a list comprehension:

    fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols])
    ac
    int32int32
    013
    157
    2911
  • Exclude columns with None:

    fread(data, columns = ['a',None,None,'d'])
    ad
    int32int32
    014
    158
    2912
  • Exclude columns with list comprehension:

    fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols])
    bd
    int32int32
    024
    168
    21012
  • Drop columns by assigning None to the columns via a dictionary:

    data = ("A,B,C,D\n" "1,3,5,7\n" "2,4,6,8\n") fread(data, columns={"B":None,"D":None})
    AC
    int32int32
    015
    126
  • Drop a column and change data type:

    fread(data, columns={"B":None, "C":str})
    ACD
    int32str32int32
    0157
    1268
  • Change column name and type, and drop a column:

    # pass a tuple, where the first item in the tuple is the new column name, # and the other item is the new data type. fread(data, columns={"A":("first", float), "B":None,"D":None})
    firstC
    float64int32
    015
    126

You can also select which columns to read dynamically, based on the names/types of the columns in the file:

def colfilter(columns): return [col.name=='species' or "length" in col.name for col in columns] fread('iris.csv', columns=colfilter, max_nrows=5)
sepal_lengthpetal_lengthspecies
float64float64str32
05.11.4setosa
14.91.4setosa
24.71.3setosa
34.61.5setosa
451.4setosa

The same approach can be used to auto-rename columns as they are read from the file:

def rename(columns): return [col.name.upper() for col in columns] fread('iris.csv', columns=rename, max_nrows=5)
SEPAL_LENGTHSEPAL_WIDTHPETAL_LENGTHPETAL_WIDTHSPECIES
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa

Selecting Data

Selecting Data – Columns

Column selection is via the j section in the DT[i, j, ...] syntax. First, let’s construct a simple Frame:

from datatable import dt, f from datetime import date source = {"dates" : [date(2000, 1, 5), date(2010, 11, 23), date(2020, 2, 29), None], "integers" : range(1, 5), "floats" : [10.0, 11.5, 12.3, -13], "strings" : ['A', 'B', None, 'D'] } DT = dt.Frame(source) DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

Column selection is possible via a number of options:

By column name

DT[:, 'dates']
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA

When selecting all rows, the i section can also be ....

By position

DT[..., 2] # 3rd column
floats
float64
010
111.5
212.3
3-13

With position, you can select with a negative number – the column will be selected from the end; this is similar to indexing a python list:

DT[:, -2] # 2nd column from the end
floats
float64
010
111.5
212.3
3-13

For a single column, it is possible to skip the : in the i section and pass the column name or position only

DT['dates']
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
DT[0]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA

When selecting via column name or position, an error is returned if the name or position does not exist:

DT[:, 5]
ValueError: Column index 5 is invalid for a Frame with 4 columns
DT[:, 'categoricals']
KeyError: Column categoricals does not exist in the Frame

By data type

Column selection is possible by using python’s built-in types that correspond to one of the datatable’s types:

DT[:, int]
integers
int32
01
12
23
34

Or datatable’s stype/ltype:

DT[:, dt.float64]
floats
float64
010
111.5
212.3
3-13
DT[:, dt.ltype.time]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA

A list of types can be selected as well:

DT[:, [date, str]]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD

By list

Using a list allows for selection of multiple columns:

DT[:, ['integers', 'strings']]
integersstrings
int32str32
01A
12B
23NA
34D

A tuple of selectors is also allowed, although not recommended from stylistic perspective:

DT[:, (-3, 2, 3)]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D

Selection via list comprehension/generator expression is possible:

DT[:, [num for num in range(DT.ncols) if num % 2 == 0]]
datesfloats
date32float64
02000-01-0510
12010-11-2311.5
22020-02-2912.3
3NA-13

Selecting columns via a mix of column names and positions (integers) is not allowed:

DT[:, ['dates', 2]]
TypeError: Mixed selector types are not allowed. Element 1 is of type integer, whereas the previous element(s) were of type string

Via slicing

When slicing with strings, both the start and end column names are included in the returned frame:

DT[:, 'dates':'strings']
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

However, when slicing via position, the columns are returned up to, but not including the final position; this is similar to the slicing pattern for Python’s sequences:

DT[:, 1:3]
integersfloats
int32float64
0110
1211.5
2312.3
34-13
DT[:, ::-1]
stringsfloatsintegersdates
str32float64int32date32
0A1012000-01-05
1B11.522010-11-23
2NA12.332020-02-29
3D-134NA

It is possible to select columns via slicing, even if the indices are not in the Frame:

DT[:, 3:10] # there are only four columns in the Frame
strings
str32
0A
1B
2NA
3D

Unlike with integer slicing, providing a name of the column that is not in the Frame will result in an error:

DT[:, "integers" : "categoricals"]
KeyError: Column categoricals does not exist in the Frame

Slicing is also possible with the standard slice function:

DT[:, slice('integers', 'strings')]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D

With the slice function, multiple slicing on the columns is possible:

DT[:, [slice("dates", "integers"), slice("floats", "strings")]]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
DT[:, [slice("integers", "dates"), slice("strings", "floats")]]
integersdatesstringsfloats
int32date32str32float64
012000-01-05A10
122010-11-23B11.5
232020-02-29NA12.3
34NAD-13

Slicing on strings can be combined with column names during selection:

DT[:, [slice("integers", "dates"), "strings"]]
integersdatesstrings
int32date32str32
012000-01-05A
122010-11-23B
232020-02-29NA
34NAD

But not with integers:

DT[:, [slice("integers", "dates"), 1]]
TypeError: Mixed selector types are not allowed. Element 1 is of type integer, whereas the previous element(s) were of type string

Slicing on position can be combined with column position:

DT[:, [slice(1, 3), 0]]
integersfloatsdates
int32float64date32
01102000-01-05
1211.52010-11-23
2312.32020-02-29
34-13NA

But not with strings:

DT[:, [slice(1, 3), "dates"]]
TypeError: Mixed selector types are not allowed. Element 1 is of type string, whereas the previous element(s) were of type integer

Via booleans

When selecting via booleans, the sequence length must be equal to the number of columns in the frame:

DT[:, [True, True, False, False]]
datesintegers
date32int32
02000-01-051
12010-11-232
22020-02-293
3NA4

Booleans generated from a list comprehension/generator expression allow for nifty selections:

DT[:, ["i" in name for name in DT.names]]
integersstrings
int32str32
01A
12B
23NA
34D

In this example we want to select columns that are numeric (integers or floats) and whose average is greater than 3:

DT[:, [column.stype.ltype.name in ("real", "int") and column.mean1() > 3 for column in DT]]
floats
float64
010
111.5
212.3
3-13

Via f-expressions

All the selection options above (except boolean) are also possible via f-expressions:

DT[:, f.dates]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
DT[:, f[-1]]
strings
str32
0A
1B
2NA
3D
DT[:, f['integers':'strings']]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
DT[:, f['integers':]]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
DT[:, f[1::-1]]
integersdates
int32date32
012000-01-05
122010-11-23
232020-02-29
34NA
DT[:, f[date, int, float]]
datesintegersfloats
date32int32float64
02000-01-05110
12010-11-23211.5
22020-02-29312.3
3NA4-13
DT[:, f["dates":"integers", "floats":"strings"]]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D

Note

If the columns names are python keywords (def, del, …), the dot notation is not possible with f-expressions; you have to use the brackets notation to access these columns.

Note

Selecting columns with DT[:, f[None]] returns an empty Frame. This is different from DT[:, None], which currently returns all the columns. The behavior of DT[:, None] may change in the future:

DT[:, None]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
DT[:, f[None]]
0
1
2
3

Selecting Data – Rows

There are a number of ways to select rows of data via the i section.

Note

The index labels in a Frame are just for aesthetics; they serve no actual purpose during selection.

By Position

Only integer values are acceptable:

DT[0, :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
DT[-1, :] # last row
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D

Via Sequence of Positions

Any acceptable sequence of positions is applicable here. Listed below are some of these sequences.

  • List (tuple):

    DT[[1, 2, 3], :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
    2NA4-13D
  • An integer numpy 1-D Array:

    DT[np.arange(3), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02000-01-05110A
    12010-11-23211.5B
    22020-02-29312.3NA
  • A one column integer Frame:

    DT[dt.Frame([1, 2, 3]), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
    2NA4-13D
  • An integer pandas Series:

    DT[pd.Series([1, 2, 3]), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
    2NA4-13D
  • A python range:

    DT[range(1, 3), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02010-11-23211.5B
    12020-02-29312.3NA
  • A generator expression:

    DT[(num for num in range(4)), :]
    datesintegersfloatsstrings
    date32int32float64str32
    02000-01-05110A
    12010-11-23211.5B
    22020-02-29312.3NA
    3NA4-13D

If the position passed to i does not exist, an error is raised

DT[(num for num in range(7)), :]
ValueError: Index 4 is invalid for a Frame with 4 rows

The set sequence is not acceptable in the i or j sections.

Except for lists/tuples, all the other sequence types passed into the i section can only contain positive integers.

Via booleans

When selecting rows via boolean sequence, the length of the sequence must be the same as the number of rows:

DT[[True, True, False, False], :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
DT[(n%2 == 0 for n in range(DT.nrows)), :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12020-02-29312.3NA

Via slicing

Slicing works similarly to slicing a python list:

DT[1:3, :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
DT[::-1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
12020-02-29312.3NA
22010-11-23211.5B
32000-01-05110A
DT[-1:-3:-1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
12020-02-29312.3NA

Slicing is also possible with the slice function:

DT[slice(1, 3), :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA

It is possible to select rows with multiple slices. Let’s increase the number of rows in the Frame:

DT = dt.repeat(DT, 3) DT
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
42000-01-05110A
52010-11-23211.5B
62020-02-29312.3NA
7NA4-13D
82000-01-05110A
92010-11-23211.5B
102020-02-29312.3NA
11NA4-13D
DT[[slice(1, 3), slice(5, 8)], :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
22010-11-23211.5B
32020-02-29312.3NA
4NA4-13D
DT[[slice(5, 8), 1, 3, slice(10, 12)], :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
2NA4-13D
32010-11-23211.5B
4NA4-13D
52020-02-29312.3NA
6NA4-13D

Via f-expressions

f-expressions return booleans that can be used to filter/select the appropriate rows:

DT[f.dates < dt.Frame([date(2020,1,1)]), :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
DT[f.integers % 2 != 0, :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12020-02-29312.3NA
DT[(f.integers == 3) & (f.strings == None), ...]
datesintegersfloatsstrings
date32int32float64str32
02020-02-29312.3NA
12020-02-29312.3NA
22020-02-29312.3NA

Selection is possible via the data types:

DT[f[float] < 1, :]
datesintegersfloatsstrings
date32int32float64str32
0NA4-13D
1NA4-13D
2NA4-13D
DT[dt.rowsum(f[int, float]) > 12, :]
datesintegersfloatsstrings
date32int32float64str32
02010-11-23211.5B
12020-02-29312.3NA
22010-11-23211.5B
32020-02-29312.3NA
42010-11-23211.5B
52020-02-29312.3NA

Select rows and columns

Specific selections can occur in rows and columns simultaneously:

DT[0, slice(1, 3)]
integersfloats
int32float64
0110
DT[2 : 6, ["i" in name for name in DT.names]]
integersstrings
int32str32
03NA
14D
21A
32B
DT[f.integers > dt.mean(f.floats) - 3, f['strings' : 'integers']]
stringsfloatsintegers
str32float64int32
0NA12.33
1D-134
2NA12.33
3D-134
4NA12.33
5D-134

Single value access

Passing single integers into the i and j sections returns a scalar value:

DT[0, 0]
datetime.date(2000, 1, 5)
DT[0, 2]
10.0
DT[-3, 'strings']
'B'

Deselect rows/columns

Deselection of rows/columns is possible via list comprehension/generator expression

  • Deselect a single column/row:

    # The list comprehension returns the specific column names DT[:, [name for name in DT.names if name != "integers"]]
    datesfloatsstrings
    date32float64str32
    02000-01-0510A
    12010-11-2311.5B
    22020-02-2912.3NA
    3NA-13D
    42000-01-0510A
    52010-11-2311.5B
    62020-02-2912.3NA
    7NA-13D
    82000-01-0510A
    92010-11-2311.5B
    102020-02-2912.3NA
    11NA-13D
    # A boolean sequence is returned in the list comprehension DT[[num != 5 for num in range(DT.nrows)], 'dates']
    dates
    date32
    02000-01-05
    12010-11-23
    22020-02-29
    3NA
    42000-01-05
    52020-02-29
    6NA
    72000-01-05
    82010-11-23
    92020-02-29
    10NA
  • Deselect multiple columns/rows:

    DT[:, [name not in ("integers", "dates") for name in DT.names]]
    floatsstrings
    float64str32
    010A
    111.5B
    212.3NA
    3-13D
    410A
    511.5B
    612.3NA
    7-13D
    810A
    911.5B
    1012.3NA
    11-13D
    DT[(num not in range(3, 8) for num in range(DT.nrows)), ['integers', 'floats']]
    integersfloats
    int32float64
    0110
    1211.5
    2312.3
    3110
    4211.5
    5312.3
    64-13
    DT[:, [num not in (2, 3) for num in range(DT.ncols)]]
    datesintegers
    date32int32
    02000-01-051
    12010-11-232
    22020-02-293
    3NA4
    42000-01-051
    52010-11-232
    62020-02-293
    7NA4
    82000-01-051
    92010-11-232
    102020-02-293
    11NA4
    # an alternative to the previous example DT[:, [num not in (2, 3) for num, _ in enumerate(DT.names)]]
    datesintegers
    date32int32
    02000-01-051
    12010-11-232
    22020-02-293
    3NA4
    42000-01-051
    52010-11-232
    62020-02-293
    7NA4
    82000-01-051
    92010-11-232
    102020-02-293
    11NA4
  • Deselect by data type:

    # This selects columns that are not numeric DT[2 : 7, (dtype.name not in ("real", "int") for dtype in DT.ltypes)]
    datesstrings
    date32str32
    02020-02-29NA
    1NAD
    22000-01-05A
    32010-11-23B
    42020-02-29NA

Slicing could be used to exclude rows/columns. The code below excludes rows from position 3 to 6:

DT[[slice(None, 3), slice(7, None)], :]
datesintegersfloatsstrings
date32int32float64str32
02000-01-05110A
12010-11-23211.5B
22020-02-29312.3NA
3NA4-13D
42000-01-05110A
52010-11-23211.5B
62020-02-29312.3NA
7NA4-13D

Columns can also be deselected via the remove() method, where the column name, column position, or data type is passed to the f symbol:

DT[:, f[:].remove(f.dates)]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4110A
5211.5B
6312.3NA
74-13D
8110A
9211.5B
10312.3NA
114-13D
DT[:, f[:].remove(f[0])]
integersfloatsstrings
int32float64str32
0110A
1211.5B
2312.3NA
34-13D
4110A
5211.5B
6312.3NA
74-13D
8110A
9211.5B
10312.3NA
114-13D
DT[:, f[:].remove(f[1:3])]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD
42000-01-05A
52010-11-23B
62020-02-29NA
7NAD
82000-01-05A
92010-11-23B
102020-02-29NA
11NAD
DT[:, f[:].remove(f['strings':'integers'])]
dates
date32
02000-01-05
12010-11-23
22020-02-29
3NA
42000-01-05
52010-11-23
62020-02-29
7NA
82000-01-05
92010-11-23
102020-02-29
11NA
DT[:, f[:].remove(f[int, float])]
datesstrings
date32str32
02000-01-05A
12010-11-23B
22020-02-29NA
3NAD
42000-01-05A
52010-11-23B
62020-02-29NA
7NAD
82000-01-05A
92010-11-23B
102020-02-29NA
11NAD
DT[:, f[:].remove(f[:])]
0
1
2
3
4
5
6
7
8
9
10
11

Delete rows/columns

To actually delete a row (or a column), use the del statement; this is an in-place operation, and as such no reassignment is needed

  • Delete multiple rows:

    del DT[3:7, :] DT
    datesintegersfloatsstrings
    date32int32float64str32
    02000-01-05110A
    12010-11-23211.5B
    22020-02-29312.3NA
    3NA4-13D
    42000-01-05110A
    52010-11-23211.5B
    62020-02-29312.3NA
    7NA4-13D
  • Delete a single row:

    del DT[3, :] DT
    datesintegersfloats
    date32int32float64
    02000-01-05110
    12010-11-23211.5
    22020-02-29NANA
    32000-01-05NANA
    42010-11-23211.5
    52020-02-29312.3
    6NA4-13
  • Delete a column:

    del DT['strings'] DT
    datesintegersfloats
    date32int32float64
    02000-01-05110
    12010-11-23211.5
    22020-02-29312.3
    3NA4-13
    42000-01-05110
    52010-11-23211.5
    62020-02-29312.3
    7NA4-13
  • Delete multiple columns:

    del DT[:, ['dates', 'floats']] DT
    integers
    int32
    01
    12
    2NA
    3NA
    42
    53
    64

Grouping with by()

The by() modifier splits a dataframe into groups, either via the provided column(s) or f-expressions, and then applies i and j within each group. This split-apply-combine strategy allows for a number of operations:

  • Aggregations per group,

  • Transformation of a column or columns, where the shape of the dataframe is maintained,

  • Filtration, where some data are kept and the others discarded, based on a condition or conditions.

Aggregation

The aggregate function is applied in the j section.

Group by one column:

from datatable import (dt, f, by, ifelse, update, sort, count, min, max, mean, sum, rowsum) df = dt.Frame("""Fruit Date Name Number Apples 10/6/2016 Bob 7 Apples 10/6/2016 Bob 8 Apples 10/6/2016 Mike 9 Apples 10/7/2016 Steve 10 Apples 10/7/2016 Bob 1 Oranges 10/7/2016 Bob 2 Oranges 10/6/2016 Tom 15 Oranges 10/6/2016 Mike 57 Oranges 10/6/2016 Bob 65 Oranges 10/7/2016 Tony 1 Grapes 10/7/2016 Bob 1 Grapes 10/7/2016 Tom 87 Grapes 10/7/2016 Bob 22 Grapes 10/7/2016 Bob 12 Grapes 10/7/2016 Tony 15""") df[:, sum(f.Number), by('Fruit')]
FruitNumber
str32int64
0Apples35
1Grapes137
2Oranges140

Group by multiple columns:

df[:, sum(f.Number), by('Fruit', 'Name')]
FruitNameNumber
str32str32int64
0ApplesBob16
1ApplesMike9
2ApplesSteve10
3GrapesBob35
4GrapesTom87
5GrapesTony15
6OrangesBob67
7OrangesMike57
8OrangesTom15
9OrangesTony1

By column position:

df[:, sum(f.Number), by(f[0])]
FruitNumber
str32int64
0Apples35
1Grapes137
2Oranges140

By boolean expression:

df[:, sum(f.Number), by(f.Fruit == "Apples")]
C0Number
bool8int64
00277
1135

Combination of column and boolean expression:

df[:, sum(f.Number), by(f.Name, f.Fruit == "Apples")]
NameC0Number
str32bool8int64
0Bob0102
1Bob116
2Mike057
3Mike19
4Steve110
5Tom0102
6Tony016

The grouping column can be excluded from the final output:

df[:, sum(f.Number), by('Fruit', add_columns=False)]
Number
int64
035
1137
2140

Note

  • The resulting dataframe has the grouping column(s) as the first column(s).

  • The grouping columns are excluded from j, unless explicitly included.

  • The grouping columns are sorted in ascending order.

Apply multiple aggregate functions to a column in the j section:

df[:, {"min": min(f.Number), "max": max(f.Number)}, by('Fruit','Date')]
FruitDateminmax
str32str32int32int32
0Apples10/6/201679
1Apples10/7/2016110
2Grapes10/7/2016187
3Oranges10/6/20161565
4Oranges10/7/201612

Functions can be applied across a columnset. Task : Get sum of col3 and col4, grouped by col1 and col2:

df = dt.Frame(""" col1 col2 col3 col4 col5 a c 1 2 f a c 1 2 f a d 1 2 f b d 1 2 g b e 1 2 g b e 1 2 g""") df[:, sum(f["col3":"col4"]), by('col1', 'col2')]
col1col2col3col4
str32str32int64int64
0ac24
1ad12
2bd12
3be24

Apply different aggregate functions to different columns:

df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]
col1col2col3col4
str32str32int8int32
0ac12
1ad12
2bd12
3be12

Nested aggregations in j. Task : Group by column idx and get the row sum of A and B, C and D:

df = dt.Frame(""" idx A B C D cat J 1 2 3 1 x K 4 5 6 2 x L 7 8 9 3 y M 1 2 3 4 y N 4 5 6 5 z O 7 8 9 6 z""") df[:, {"AB" : sum(rowsum(f['A':'B'])), "CD" : sum(rowsum(f['C':'D']))}, by('cat') ]
catABCD
str32int64int64
0x1212
1y1819
2z2426

Computation between aggregated columns. Task: get the difference between the largest and smallest value within each group:

df = dt.Frame("""GROUP VALUE 1 5 2 2 1 10 2 20 1 7""") df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]
GROUPC0
int32int32
015
1218

Null values are not excluded from the grouping column:

df = dt.Frame(""" a b c 1 2.0 3 1 NaN 4 2 1.0 3 1 2.0 2""") df[:, sum(f[:]), by('b')]
bac
float64int64int64
0NA14
1123
2225

If you wish to ignore null values, first filter them out:

df[f.b != None, :][:, sum(f[:]), by('b')]
bac
float64int64int64
0123
1225

Filtration

This occurs in the i section of the groupby, where only a subset of the data per group is needed; selection is limited to integers or slicing.

Note

  • i is applied after the grouping, not before.

  • f-expressions in the i section is not yet implemented for groupby.

Select the first row per group:

df = dt.Frame("""A B 1 10 1 20 2 30 2 40 3 10""") # passing 0 as index gets the first row after the grouping # note that python's index starts from 0, not 1 df[0, :, by('A')]
AB
int32int32
0110
1230
2310

Select the last row per group:

df[-1, :, by('A')]
AB
int32int32
0120
1240
2310

Select the nth row per group. Task : select the second row per group:

df[1, :, by('A')]
AB
int32int32
0120
1240

Note

Filtering this way can be used to drop duplicates; you can decide to keep the first or last non-duplicate.

Select the latest entry per group:

df = dt.Frame(""" id product date 220 6647 2014-09-01 220 6647 2014-09-03 220 6647 2014-10-16 826 3380 2014-11-11 826 3380 2014-12-09 826 3380 2015-05-19 901 4555 2014-09-01 901 4555 2014-10-05 901 4555 2014-11-01""") df[-1, :, by('id'), sort('date')]
idproductdate
int32int32str32
022066472014-10-16
182633802015-05-19
290145552014-11-01

Note

If sort and by modifiers are present, the sorting occurs after the grouping, and occurs within each group.

Replicate SQL’s HAVING clause. Task: Filter for groups where the length/count is greater than 1:

df = dt.Frame([[1, 1, 5], [2, 3, 6]], names=['A', 'B']) df
AB
int32int32
012
113
256
# Get the count of each group, # and assign to a new column, using the update method # note that the update operation is in-place; # there is no need to assign back to the dataframe df[:, update(filter_col = count()), by('A')] # The new column will be added to the end # We use an f-expression to return rows # in each group where the count is greater than 1 df[f.filter_col > 1, f[:-1]]
AB
int32int32
012
113

Keep only rows per group where diff is the minimum:

df = dt.Frame(""" item diff otherstuff 1 2 1 1 1 2 1 3 7 2 -1 0 2 1 3 2 4 9 2 -6 2 3 0 0 3 2 9""") df[:, #get boolean for rows where diff column is minimum for each group update(filter_col = f.diff == min(f.diff)), by('item')] df[f.filter_col == 1, :-1]
itemdiffotherstuff
int32int32int32
0112
12-62
2300

Keep only entries where make has both 0 and 1 in sales:

df = dt.Frame(""" make country other_columns sale honda tokyo data 1 honda hirosima data 0 toyota tokyo data 1 toyota hirosima data 0 suzuki tokyo data 0 suzuki hirosima data 0 ferrari tokyo data 1 ferrari hirosima data 0 nissan tokyo data 1 nissan hirosima data 0""") df[:, update(filter_col = sum(f.sale)), by('make')] df[f.filter_col == 1, :-1]
makecountryother_columnssale
str32str32str32bool8
0hondatokyodata1
1hondahirosimadata0
2toyotatokyodata1
3toyotahirosimadata0
4ferraritokyodata1
5ferrarihirosimadata0
6nissantokyodata1
7nissanhirosimadata0

Transformation

This is when a function is applied to a column after a groupby and the resulting column is appended back to the dataframe. The number of rows of the dataframe is unchanged. The update() method makes this possible and easy.

Get the minimum and maximum of column c per group, and append to dataframe:

df = dt.Frame(""" c y 9 0 8 0 3 1 6 2 1 3 2 3 5 3 4 4 0 4 7 4""") # Assign the new columns via the update method df[:, update(min_col = min(f.c), max_col = max(f.c)), by('y')] df
cymin_colmax_col
int32int32int32int32
09089
18089
23133
36266
41315
52315
65315
74407
80407
97407

Fill missing values by group mean:

df = dt.Frame({'value' : [1, None, None, 2, 3, 1, 3, None, 3], 'name' : ['A','A', 'B','B','B','B', 'C','C','C']}) df
valuename
float64str32
01A
1NAA
2NAB
32B
43B
51B
63C
7NAC
83C
# This uses a combination of update and ifelse methods: df[:, update(value = ifelse(f.value == None, mean(f.value), f.value)), by('name')] df
valuename
float64str32
01A
11A
22B
32B
43B
51B
63C
73C
83C

Transform and Aggregate on multiple columns

Task: Get the sum of the aggregate of column a and b, grouped by c and d and append to dataframe:

df = dt.Frame({'a' : [1,2,3,4,5,6], 'b' : [1,2,3,4,5,6], 'c' : ['q', 'q', 'q', 'q', 'w', 'w'], 'd' : ['z','z','z','o','o','o']}) df
abcd
int32int32str32str32
011qz
122qz
233qz
344qo
455wo
566wo
df[:, update(e = sum(f.a) + sum(f.b)), by('c', 'd') ] df
abcde
int32int32str32str32int64
011qz12
122qz12
233qz12
344qo8
455wo22
566wo22

Replicate R’s groupby mutate

Task : Get ratio by dividing column c by the product of column c and d, grouped by a and b:

df = dt.Frame(dict(a = (1,1,0,1,0), b = (1,0,0,1,0), c = (10,5,1,5,10), d = (3,1,2,1,2)) ) df
abcd
int8int8int32int32
011103
11051
20012
31151
400102
df[:, update(ratio = f.c / sum(f.c * f.d)), by('a', 'b') ] df
abcdratio
int8int8int32int32float64
0111030.285714
110511
200120.0454545
311510.142857
4001020.454545

Groupby on boolean expressions

Conditional sum with groupby

Task: sum data1 column, grouped by key1 and rows where key2 == "one":

df = dt.Frame("""data1 data2 key1 key2 0.361601 0.375297 a one 0.069889 0.809772 a two 1.468194 0.272929 b one -1.138458 0.865060 b two -0.268210 1.250340 a one""")
>>> >>>
df[:, sum(f.data1), by(f.key2 == "one", f.key1)][f.C0 == 1, 1:]
key1data1
str32float64
0a0.093391
1b1.46819

Conditional sums based on various criteria

df = dt.Frame(""" A_id B C a1 "up" 100 a2 "down" 102 a3 "up" 100 a3 "up" 250 a4 "left" 100 a5 "right" 102""") df[:, {"sum_up": sum(f.B == "up"), "sum_down" : sum(f.B == "down"), "over_200_up" : sum((f.B == "up") & (f.C > 200)) }, by('A_id')]
A_idsum_upsum_downover_200_up
str32int64int64int64
0a1100
1a2010
2a3201
3a4000
4a5000

More Examples

Aggregation on values in a column

Task: group by Day and find minimum Data_Value for elements of type TMIN and maximum Data_Value for elements of type TMAX:

df = dt.Frame(""" Day Element Data_Value 01-01 TMAX 112 01-01 TMAX 101 01-01 TMIN 60 01-01 TMIN 0 01-01 TMIN 25 01-01 TMAX 113 01-01 TMAX 115 01-01 TMAX 105 01-01 TMAX 111 01-01 TMIN 44 01-01 TMIN 83 01-02 TMAX 70 01-02 TMAX 79 01-02 TMIN 0 01-02 TMIN 60 01-02 TMAX 73 01-02 TMIN 31 01-02 TMIN 26 01-02 TMAX 71 01-02 TMIN 26""") df[:, {"TMAX": max(ifelse(f.Element=="TMAX", f.Data_Value, None)), "TMIN": min(ifelse(f.Element=="TMIN", f.Data_Value, None))}, by(f.Day)]
DayTMAXTMIN
str32int32int32
001-011150
101-02790

Group-by and conditional sum and add back to data frame

Task: sum the Count value for each ID, when Num is (17 or 12) and Letter is 'D' and then add the calculation back to the original data frame as column 'Total':

df = dt.Frame(""" ID Num Letter Count 1 17 D 1 1 12 D 2 1 13 D 3 2 17 D 4 2 12 A 5 2 16 D 1 3 16 D 1""") expression = ((f.Num==17) | (f.Num==12)) & (f.Letter == "D") df[:, update(Total = sum(expression * f.Count)), by(f.ID)] df
IDNumLetterCountTotal
int32int32str32int32int64
0117D13
1112D23
2113D33
3217D44
4212A54
5216D14
6316D10

Indexing with multiple min and max in one aggregate

Task : find col1 where col2 is max, col2 where col3 is min and col1 where col3 is max:

df = dt.Frame({ "id" : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3], "col1" : [1, 3, 5, 2, 5, 3, 6, 3, 67, 7], "col2" : [4, 6, 8, 3, 65, 3, 5, 4, 4, 7], "col3" : [34, 64, 53, 5, 6, 2, 4, 6, 4, 67], }) df
idcol1col2col3
int32int32int32int32
011434
113664
215853
32235
425656
52332
62654
73346
836744
937767
df[:, {'col1' : max(ifelse(f.col2 == max(f.col2), f.col1, None)), 'col2' : max(ifelse(f.col3 == min(f.col3), f.col2, None)), 'col3' : max(ifelse(f.col3 == max(f.col3), f.col1, None)) }, by('id')]
idcol1col2col3
int32int32int32int32
01543
12535
23747

Filter rows based on aggregate value

Task: for every word find the tag that has the most count:

df = dt.Frame("""word tag count a S 30 the S 20 a T 60 an T 5 the T 10""") # The solution builds on the knowledge that sorting # while grouping sorts within each group. df[0, :, by('word'), sort(-f.count)]
wordtagcount
str32str32int32
0aT60
1anT5
2theS20

Get the rows where the value column is minimum, and rename columns:

df = dt.Frame({"category": ["A"]*3 + ["B"]*3, "date": ["9/6/2016", "10/6/2016", "11/6/2016", "9/7/2016", "10/7/2016", "11/7/2016"], "value": [7,8,9,10,1,2]}) df
categorydatevalue
str32str32int32
0A9/6/20167
1A10/6/20168
2A11/6/20169
3B9/7/201610
4B10/7/20161
5B11/7/20162
df[0, {"value_date": f.date, "value_min": f.value}, by("category"), sort('value')]
categoryvalue_datevalue_min
str32str32int32
0A9/6/20167
1B10/7/20161

Get the rows where the value column is maximum, and rename columns:

df[0, {"value_date": f.date, "value_max": f.value}, by("category"), sort(-f.value)]
categoryvalue_datevalue_max
str32str32int32
0A11/6/20169
1B9/7/201610

Get the average of the last three instances per group:

import random random.seed(3) df = dt.Frame({"Student": ["Bob", "Bill", "Bob", "Bob", "Bill","Joe", "Joe", "Bill", "Bob", "Joe",], "Score": random.sample(range(10,30), 10)}) df
StudentScore
str32int32
0Bob17
1Bill28
2Bob27
3Bob14
4Bill21
5Joe24
6Joe19
7Bill29
8Bob20
9Joe23
df[-3:, mean(f[:]), f.Student]
StudentScore
str32float64
0Bill26
1Bob20.3333
2Joe22

Group by on a condition

Get the sum of Amount for Number in range (1 to 4) and (5 and above):

df = dt.Frame("""Number, Amount 1, 5 2, 10 3, 11 4, 3 5, 5 6, 8 7, 9 8, 6""") df[:, sum(f.Amount), by(ifelse(f.Number>=5, "B", "A"))]
C0Amount
str32int64
0A29
1B28

Row Functions

Functions rowall(), rowany(), rowcount(), rowfirst(), rowlast(), rowmax(), rowmean(), rowmin(), rowsd(), rowsum() are functions that aggregate across rows instead of columns and return a single column. These functions are equivalent to pandas aggregation functions with parameter (axis=1).

These functions make it easy to compute rowwise aggregations – for instance, you may want the sum of columns A, B, C and D. You could say: f.A + f.B + f.C + f.D. Rowsum makes it easier – dt.rowsum(f['A':'D']).

rowall, rowany

These work only on boolean expressions – rowall() checks if all the values in the row are True, while rowany() checks if any value in the row is True. It is similar to pandas’ all or any with parameter (axis=1). A single boolean column is returned:

from datatable import dt, f, by df = dt.Frame({'A': [True, True], 'B': [True, False]}) df
AB
bool8bool8
011
110
# rowall : df[:, dt.rowall(f[:])]
C0
bool8
01
10
# rowany : df[:, dt.rowany(f[:])]
C0
bool8
01
11

The single boolean column that is returned can be very handy when filtering in the i section.

Filter for rows where at least one cell is greater than 0:

df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0], 'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0], 'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], 'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]}) df
abcdef
int8int8int32int8int8int8
0000000
1000001
2000000
3000000
4000000
5005000
6100000
7000000
8000100
9100000
10000000
df[dt.rowany(f[:] > 0), :]
abcdef
int8int8int32int8int8int8
0000001
1005000
2100000
3000100
4100000

Filter for rows where all the cells are 0:

df[dt.rowall(f[:] == 0), :]
abcdef
int8int8int32int8int8int8
0000000
1000000
2000000
3000000
4000000
5000000

Filter for rows where all the columns’ values are the same:

df = dt.Frame("""Name A1 A2 A3 A4 deff 0 0 0 0 def1 0 1 0 0 def2 0 0 0 0 def3 1 0 0 0 def4 0 0 0 0""") # compare the first integer column with the rest, # use rowall to find rows where all is True # and filter with the resulting boolean df[dt.rowall(f[1]==f[1:]), :]
NameA1A2A3A4
str32bool8bool8bool8bool8
0deff0000
1def20000
2def40000

Filter for rows where the values are increasing:

df = dt.Frame({"A": [1, 2, 6, 4], "B": [2, 4, 5, 6], "C": [3, 5, 4, 7], "D": [4, -3, 3, 8], "E": [5, 1, 2, 9]}) df
ABCDE
int32int32int32int32int32
012345
1245-31
265432
346789
df[dt.rowall(f[1:] >= f[:-1]), :]
ABCDE
int32int32int32int32int32
012345
146789

rowfirst, rowlast

These look for the first and last non-missing value in a row respectively:

df = dt.Frame({'A':[1, None, None, None], 'B':[None, 3, 4, None], 'C':[2, None, 5, None]}) df
ABC
int8int32int32
01NA2
1NA3NA
2NA45
3NANANA
# rowfirst : df[:, dt.rowfirst(f[:])]
C0
int32
01
13
24
3NA
# rowlast : df[:, dt.rowlast(f[:])]
C0
int32
02
13
25
3NA

Get rows where the last value in the row is greater than the first value in the row:

df = dt.Frame({'a': [50, 40, 30, 20, 10], 'b': [60, 10, 40, 0, 5], 'c': [40, 30, 20, 30, 40]}) df
abc
int32int32int32
0506040
1401030
2304020
320030
410540
df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :]
abc
int32int32int32
020030
110540

rowmax, rowmin

These get the maximum and minimum values per row, respectively:

df = dt.Frame({"C": [2, 5, 30, 20, 10], "D": [10, 8, 20, 20, 1]}) df
CD
int32int32
0210
158
23020
32020
4101
# rowmax df[:, dt.rowmax(f[:])]
C0
int32
010
18
230
320
410
# rowmin df[:, dt.rowmin(f[:])]
C0
int32
02
15
220
320
41

Find the difference between the maximum and minimum of each row:

df = dt.Frame("""Value1 Value2 Value3 Value4 5 4 3 2 4 3 2 1 3 3 5 1""") df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))] df
Value1Value2Value3Value4max_min
int32int32int32int32int32
054323
143213
233514

rowsum, rowmean, rowcount, rowsd

rowsum() and rowmean() get the sum and mean of rows respectively; rowcount() counts the number of non-missing values in a row, while rowsd() aggregates a row to get the standard deviation.

Get the count, sum, mean and standard deviation for each row:

df = dt.Frame("""ORD A B C D 198 23 45 NaN 12 138 25 NaN NaN 62 625 52 36 49 35 457 NaN NaN NaN 82 626 52 32 39 45""") df[:, dt.update(rowcount = dt.rowcount(f[:]), rowsum = dt.rowsum(f[:]), rowmean = dt.rowmean(f[:]), rowsd = dt.rowsd(f[:]) )] df
ORDABCDrowcountrowsumrowmeanrowsd
int32float64float64float64int32int32float64float64float64
01982345NA12427869.586.7583
113825NANA6232257557.6108
2625523649355797159.4260.389
3457NANANA822539269.5265.165
4626523239455794158.8261.277

Find rows where the number of nulls is greater than 3:

df = dt.Frame({'city': ["city1", "city2", "city3", "city4"], 'state': ["state1", "state2", "state3", "state4"], '2005': [144, 205, 123, None], '2006': [173, 211, 123, 124], '2007': [None, None, None, None], '2008': [None, 206, None, None], '2009': [None, None, 124, 123], '2010': [128, 273, None, None]}) df
citystate200520062007200820092010
str32str32int32int32voidint32int32int32
0city1state1144173NANANA128
1city2state2205211NA206NA273
2city3state3123123NANA124NA
3city4state4NA124NANA123NA
# get columns that are null, then sum on the rows # and finally filter where the sum is greater than 3 df[dt.rowsum(dt.isna(f[:])) > 3, :]
citystate200520062007200820092010
str32str32int32int32voidint32int32int32
0city4state4NA124NANA123NA

Rowwise sum of the float columns:

df = dt.Frame("""ID W_1 W_2 W_3 1 0.1 0.2 0.3 1 0.2 0.4 0.5 2 0.3 0.3 0.2 2 0.1 0.3 0.4 2 0.2 0.0 0.5 1 0.5 0.3 0.2 1 0.4 0.2 0.1""") df[:, dt.update(sum_floats = dt.rowsum(f[float]))] df
IDW_1W_2W_3sum_floats
int32float64float64float64float64
010.10.20.30.6
110.20.40.51.1
220.30.30.20.8
320.10.30.40.8
420.200.50.7
510.50.30.21
610.40.20.10.7

More Examples

Divide columns A, B, C, D by the total column, square it and sum rowwise:

df = dt.Frame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]}) df
ABCDtotal
int32int32int8int8int32
021014
132106
df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))] df
ABCDtotalresult
int32int32int8int8int32float64
0210140.375
1321060.388889

Get the row sum of the COUNT columns:

df = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3 A 1 0 1 1 A 2 1 1 2 A 3 3 0 0""") columns = [f[column] for column in df.names if column.startswith("COUNT")] df[:, update(total = dt.rowsum(columns))] df
USEROBSERVATIONCOUNT.1COUNT.2COUNT.3total
str32int32int32bool8int32int32
0A10112
1A21124
2A33003

Sum selected columns rowwise:

df = dt.Frame({'location' : ("a","b","c","d"), 'v1' : (3,4,3,3), 'v2' : (4,56,3,88), 'v3' : (7,6,2,9), 'v4': (7,6,1,9), 'v5' : (4,4,7,9), 'v6' : (2,8,4,6)}) df
locationv1v2v3v4v5v6
str32int32int32int32int32int32int32
0a347742
1b4566648
2c332174
3d3889996
df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}]
x1x2
int32int32
01413
16618
2812
310024

Comparison with pandas

A lot of potential datatable users are likely to have some familiarity with pandas; as such, this page provides some examples of how various pandas operations can be performed within datatable. The datatable module emphasizes speed and big data support (an area that pandas struggles with); it also has an expressive and concise syntax, which makes datatable also useful for small datasets.

Note: in pandas, there are two fundamental data structures: Series and DataFrame. In datatable, there is only one fundamental data structure — the Frame. Most of the comparisons will be between pandas DataFrame and datatable Frame.

import pandas as pd import numpy as np from datatable import dt, f, by, g, join, sort, update, ifelse data = {"A": [1, 2, 3, 4, 5], "B": [4, 5, 6, 7, 8], "C": [7, 8, 9, 10, 11], "D": [5, 7, 2, 9, -1]} # datatable DT = dt.Frame(data) # pandas df = pd.DataFrame(data)

Row and Column Selection

pandas
datatable

Select a single row

df.loc[2]
DT[2, :]

Select several rows by their indices

df.iloc[[2, 3, 4]]
DT[[2, 3, 4], :]

Select a slice of rows by position

df.iloc[2:5] # or df.iloc[range(2, 5)]
DT[2:5, :] # or DT[range(2, 5), :]

Select every second row

df.iloc[::2]
DT[::2, :]

Select rows using a boolean mask

df.iloc[[True, True, False, False, True]]
DT[[True, True, False, False, True], :]

Select rows on a condition

df.loc[df['A']>3]
DT[f.A>3, :]

Select rows on multiple conditions, using OR

df.loc[(df['A'] > 3) | (df['B']<5)]
DT[(f.A>3) | (f.B<5), :]

Select rows on multiple conditions, using AND

df.loc[(df['A'] > 3) & (df['B']<8)]
DT[(f.A>3) & (f.B<8), :]

Select a single column by column name

df['A'] df.loc[:, 'A']
DT['A'] DT[:, 'A']

Select a single column by position

df.iloc[:, 1]
DT[1] DT[:, 1]

Select multiple columns by column names

df.loc[:, ["A", "B"]]
DT[:, ["A", "B"]]

Select multiple columns by position

df.iloc[:, [0, 1]]
DT[:, [0, 1]]

Select multiple columns by slicing

df.loc[:, "A":"B"]
DT[:, "A":"B"]

Select multiple columns by position

df.iloc[:, 1:3]
DT[:, 1:3]

Select columns by Boolean mask

df.loc[:, [True,False,False,True]]
DT[:, [True,False,False,True]]

Select multiple rows and columns

df.loc[2:5, "A":"B"]
DT[2:5, "A":"B"]

Select multiple rows and columns by position

df.iloc[2:5, :2]
DT[2:5, :2]

Select a single value (returns a scalar)

df.at[2, 'A'] df.loc[2, 'A']
DT[2, "A"]

Select a single value by position

df.iat[2, 0] df.iloc[2, 0]
DT[2, 0]

Select a single value, return as Series

df.loc[2, ["A"]]
DT[2, ["A"]]

Select a single value (return as Series/Frame) by position

df.iloc[2, [0]]
DT[2, [0]]

In pandas every frame has a row index, and if a filtration is executed, the row numbers are kept:

df.loc[df['A'] > 3]
A B C D 3 4 7 10 9 4 5 8 11 -1

Datatable has no notion of a row index; the row numbers displayed are just for convenience:

DT[f.A > 3, :]
ABCD
int32int32int32int32
047109
15811-1

In pandas, the index can be numbers, or characters, or intervals, or even MultiIndexes; you can subset rows on these labels:

df1 = df.set_index(pd.Index(['a','b','c','d','e']))
A B C D a 1 4 7 5 b 2 5 8 7 c 3 6 9 2 d 4 7 10 9 e 5 8 11 -1
df1.loc["a":"c"]
A B C D a 1 4 7 5 b 2 5 8 7 c 3 6 9 2

Datatable has the key property, which is meant as an equivalent of pandas indices, but its purpose at the moment is for joins, not for subsetting data:

data = {"A": [1, 2, 3, 4, 5], "B": [4, 5, 6, 7, 8], "C": [7, 8, 9, 10, 11], "D": [5, 7, 2, 9, -1], "E": ['a','b','c','d','e']} DT1 = dt.Frame(data) DT1.key = 'E' DT1
EABCD
str32int32int32int32int32
a1475
b2587
c3692
d47109
e5811-1
DT1["a":"c", :] # this will fail
TypeError: A string slice cannot be used as a row selector

Pandas’ .loc notation works on labels, while .iloc works on actual positions. This is noticeable during row selection. Datatable, however, works only on positions:

df1 = df.set_index('C')
A B D C 7 1 4 5 8 2 5 7 9 3 6 2 10 4 7 9 11 5 8 -1

Selecting with .loc for the row with number 7 returns no error:

df1.loc[7]
A 1 B 4 D 5 Name: 7, dtype: int64

However, selecting with iloc for the row with number 7 returns an error, because positionally, there is no row 7:

df.iloc[7]
IndexError: single positional indexer is out-of-bounds

Datatable has the dt.Frame.key property, which is used for joins, not row subsetting, and as such selection similar to loc with the row label is not possible:

DT.key = 'C' DT
CABD
int32int32int32int32
7145
8257
9362
10479
1158-1
DT[7, :] # this will fail
ValueError: Row 7 is invalid for a frame with 5 rows

Add new/update existing columns

pandas
datatable

Add a new column with a scalar value

df['new_col'] = 2 df = df.assign(new_col = 2)
DT['new_col'] = 2 DT[:, update(new_col=2)]

Add a new column with a list of values

df['new_col'] = range(len(df)) df = df.assign(new_col = range(len(df))
DT['new_col_1'] = range(DT.nrows) DT[:, update(new_col=range(DT.nrows)]

Update a single value

df.at[2, 'new_col'] = 200
DT[2, 'new_col'] = 200

Update an entire column

df.loc[:, "A"] = 5 # or df["A"] = 5 df = df.assign(A = 5)
DT["A"] = 5 DT[:, update(A = 5)]

Update multiple columns

df.loc[:, "A":"C"] = np.arange(15).reshape(-1,3)
DT[:, "A":"C"] = np.arange(15).reshape(-1,3)

Note

In datatable, the update() method is in-place; reassigment to the Frame DT is not required.

Rename columns

pandas
datatable

Rename a column

df = df.rename(columns={"A": "col_A"})
DT.names = {"A": "col_A"}

Rename multiple columns

df = df.rename(columns={"A": "col_A", "B": "col_B"})
DT.names = {"A": "col_A", "B": "col_B"}

In datatable, you can select and rename columns at the same time, by passing a dictionary of f-expressions into the j section:

# datatable DT[:, {"A": f.A, "box": f.B, "C": f.C, "D": f.D * 2}]
AboxCD
int32int32int32int32
014710
125814
23694
3471018
45811-2

Delete Columns

pandas
datatable

Delete a column

del df['B']
del DT['B']

Same as above

df = df.drop('B', axis=1)
DT = DT[:, f[:].remove(f.B)]

Remove multiple columns

df = df.drop(['B', 'C'], axis=1)
del DT[: , ['B', 'C']] # or DT = DT[:, f[:].remove([f.B, f.C])]

Sorting

pandas
datatable

Sort by a column – default ascending

df.sort_values('A')
DT.sort('A') # or DT[:, : , sort('A')]

Sort by a column – descending

df.sort_values('A',ascending=False)
DT.sort(-f.A) # or DT[:, :, sort(-f.A)] # or DT[:, :, sort('A', reverse=True)]

Sort by multiple columns – default ascending

df.sort_values(['A', 'C'])
DT.sort('A', 'C') # or DT[:, :, sort('A', 'C')]

Sort by multiple columns – both descending

df.sort_values(['A','C'],ascending=[False,False])
DT.sort(-f.A, -f.C) # or DT[:, :, sort(-f.A, -f.C)] # or DT[:, :, sort('A', 'C', reverse=[True, True])]

Sort by multiple columns – different sort directions

df.sort_values(['A', 'C'], ascending=[True, False])
DT.sort(f.A, -f.C) # or DT[:, :, sort(f.A, -f.C)] # or DT[:, :, sort('A', 'C', reverse=[False, True])]

Note

By default, pandas puts NAs last in the sorted data, while datatable puts them first.

Note

In pandas, there is an option to sort with a Callable; this option is not supported in datatable.

Note

In pandas, you can sort on the rows or columns; in datatable sorting is column-wise only.

Grouping and Aggregation

data = {"a": [1, 1, 2, 1, 2], "b": [2, 20, 30, 2, 4], "c": [3, 30, 50, 33, 50]} # pandas df = pd.DataFrame(data) # datatable DT = dt.Frame(data) DT
abc
int32int32int32
0123
112030
223050
31233
42450
pandas
datatable

Group by column a and sum the other columns

df.groupby("a").agg("sum")
DT[:, dt.sum(f[:]), by("a")]

Group by a and b and sum c

df.groupby(["a", "b"]).agg("sum")
DT[:, dt.sum(f.c), by("a", "b")]

Get size per group

df.groupby("a").size()
DT[:, dt.count(), by("a")]

Grouping with multiple aggregation functions

df.groupby("a").agg({"b": "sum", "c": "mean"})
DT[:, {"b": dt.sum(f.b), "c": dt.mean(f.c)}, by("a")]

Get the first row per group

df.groupby("a").first()
DT[0, :, by("a")]

Get the last row per group

df.groupby('a').last()
DT[-1, :, by("a")]

Get the first two rows per group

df.groupby("a").head(2)
DT[:2, :, by("a")]

Get the last two rows per group

df.groupby("a").tail(2)
DT[-2:, :, by("a")]

Transformations within groups in pandas is done using the pd.transform function:

# pandas grouping = df.groupby("a")["b"].transform("min") df.assign(min_b=grouping)
a b c min_b 0 1 2 3 2 1 1 20 30 2 2 2 30 50 4 3 1 2 33 2 4 2 4 50 4

In datatable, transformations occur within the j section; in the presence of by(), the computations within j are per group:

# datatable DT[:, f[:].extend({"min_b": dt.min(f.b)}), by("a")]
abcmin_b
int32int32int32int32
01232
1120302
212332
3230504
424504

Note that the result above is sorted by the grouping column. If you want the data to maintain the same shape as the source data, then update() is a better option (and usually faster):

# datatable DT[:, update(min_b = dt.min(f.b)), by("a")] DT
abcmin_b
int32int32int32int32
01232
1120302
2230504
312332
424504

In pandas, some computations might require creating the column first before aggregation within a groupby. Take the example below, where we need to calculate the revenue per group:

data = {'shop': ['A', 'B', 'A'], 'item_price': [123, 921, 28], 'item_sold': [1, 2, 4]} df1 = pd.DataFrame(data) # pandas DT1 = dt.Frame(data) # datatable DT1
shopitem_priceitem_sold
str32int32int32
0A1231
1B9212
2A284

To get the total revenue, we first need to create a revenue column, then sum it in the groupby:

# pandas df1['revenue'] = df1['item_price'] * df1['item_sold'] df1.groupby("shop")['revenue'].sum().reset_index()
shop revenue 0 A 235 1 B 1842

In datatable, there is no need to create a temporary column; you can easily nest your computations in the j section; the computations will be executed per group:

# datatable DT1[:, {"revenue": dt.sum(f.item_price * f.item_sold)}, by("shop")]
shoprevenue
str32int64
0A235
1B1842

You can learn more about the by() function at the Grouping with by() documentation.

Concatenate

In pandas you can combine multiple dataframes using the concatenate() method; the concatenation is based on the indices:

# pandas df1 = pd.DataFrame({"A": ["a", "a", "a"], "B": range(3)}) df2 = pd.DataFrame({"A": ["b", "b", "b"], "B": range(4, 7)})

By default, pandas concatenates the rows, with one dataframe on top of the other:

pd.concat([df1, df2], axis = 0)
A B 0 a 0 1 a 1 2 a 2 0 b 4 1 b 5 2 b 6

The same functionality can be replicated in datatable using the dt.Frame.rbind() method:

# datatable DT1 = dt.Frame(df1) DT2 = dt.Frame(df2) dt.rbind(DT1, DT2)
AB
str32int64
0a0
1a1
2a2
3b4
4b5
5b6

Notice how in pandas the indices are preserved (you can get rid of the indices with the ignore_index argument), whereas in datatable the indices are not referenced.

To combine data across the columns, in pandas, you set the axis argument to columns:

# pandas df1 = pd.DataFrame({"A": ["a", "a", "a"], "B": range(3)}) df2 = pd.DataFrame({"C": ["b", "b", "b"], "D": range(4, 7)}) df3 = pd.DataFrame({"E": ["c", "c", "c"], "F": range(7, 10)}) pd.concat([df1, df2, df3], axis = 1)
A B C D E F 0 a 0 b 4 c 7 1 a 1 b 5 c 8 2 a 2 b 6 c 9

In datatable, you combine frames along the columns using the dt.Frame.cbind() method:

DT1 = dt.Frame(df1) DT2 = dt.Frame(df2) DT3 = dt.Frame(df3) dt.cbind([DT1, DT2, DT3])
ABCDEF
str32int64str32int64str32int64
0a0b4c7
1a1b5c8
2a2b6c9

In pandas, if you concatenate dataframes along the rows, and the columns do not match, a dataframe of all the columns is returned, with null values for the missing rows:

# pandas pd.concat([df1, df2, df3], axis = 0)
A B C D E F 0 a 0.0 NaN NaN NaN NaN 1 a 1.0 NaN NaN NaN NaN 2 a 2.0 NaN NaN NaN NaN 0 NaN NaN b 4.0 NaN NaN 1 NaN NaN b 5.0 NaN NaN 2 NaN NaN b 6.0 NaN NaN 0 NaN NaN NaN NaN c 7.0 1 NaN NaN NaN NaN c 8.0 2 NaN NaN NaN NaN c 9.0

In datatable, if you concatenate along the rows and the columns in the frames do not match, you get an error message; you can however force the row combinations, by passing force=True:

# datatable dt.rbind([DT1, DT2, DT3], force=True)
ABCDEF
str32int64str32int64str32int64
0a0NANANANA
1a1NANANANA
2a2NANANANA
3NANAb4NANA
4NANAb5NANA
5NANAb6NANA
6NANANANAc7
7NANANANAc8
8NANANANAc9

Note

rbind() and cbind() methods exist for the frames, and operate in-place.

Join/merge

pandas has a variety of options for joining dataframes, using the join or merge method; in datatable, only the left join is possible, and there are certain limitations. You have to set keys on the dataframe to be joined, and for that, the keyed columns must be unique. The main function in datatable for joining dataframes based on column values is the join() function. As such, our comparison will be limited to left-joins only.

In pandas, you can join dataframes easily with the merge method:

df1 = pd.DataFrame({"x" : ["b"]*3 + ["a"]*3 + ["c"]*3, "y" : [1, 3, 6] * 3, "v" : range(1, 10)}) df2 = pd.DataFrame({"x": ('c','b'), "v": (8,7), "foo": (4,2)}) df1.merge(df2, on="x", how="left")
x y v_x v_y foo 0 b 1 1 7.0 2.0 1 b 3 2 7.0 2.0 2 b 6 3 7.0 2.0 3 a 1 4 NaN NaN 4 a 3 5 NaN NaN 5 a 6 6 NaN NaN 6 c 1 7 8.0 4.0 7 c 3 8 8.0 4.0 8 c 6 9 8.0 4.0

In datatable, there are limitations currently. First, the joining dataframe must be keyed. Second, the values in the column(s) used as the joining key(s) must be unique, otherwise the keying operation will fail. Third, the join columns must have the same name.

DT1 = dt.Frame(df1) DT2 = dt.Frame(df2) # set key on DT2 DT2.key = 'x' DT1[:, :, join(DT2)]
xyvv.0foo
str32int64int64int64int64
0b1172
1b3272
2b6372
3a14NANA
4a35NANA
5a66NANA
6c1784
7c3884
8c6984

More details about joins in datatable can be found at the join() API and have a look at the Tutorial on the join operator.

More examples

This section shows how some solutions in pandas can be translated to datatable; the examples used here, as well as the pandas solutions, are from the pandas cookbook.

Feel free to submit a pull request on github for examples you would like to share with the community.

if-then-else

# Initial data frame: df = pd.DataFrame({"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}) df
AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50

In pandas this can be achieved using numpy’s where():

df['logic'] = np.where(df['AAA'] > 5, 'high', 'low')
AAA BBB CCC logic 0 4 10 100 low 1 5 20 50 low 2 6 30 -30 high 3 7 40 -50 high

In datatable, this can be solved using the ifelse() function:

# datatable DT = dt.Frame(df) DT["logic"] = ifelse(f.AAA > 5, "high", "low") DT
AAABBBCCClogic
int64int64int64str32
0410100low
152050low
2630-30high
3740-50high

Select rows with data closest to certain value

# pandas df = pd.DataFrame({"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}) aValue = 43.0

Solution in pandas, using argsort:

df.loc[(df.CCC - aValue).abs().argsort()]
AAA BBB CCC 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50

In datatable, the sort() function can be used to rearrange rows in the desired order:

DT = dt.Frame(df) DT[:, :, sort(dt.math.abs(f.CCC - aValue))]
AAABBBCCC
int64int64int64
052050
1410100
2630-30
3740-50

Efficiently and dynamically creating new columns using applymap

# pandas df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]})
AAA BBB CCC 0 1 1 2 1 2 1 1 2 1 2 3 3 3 2 1
source_cols = df.columns new_cols = [str(x) + "_cat" for x in source_cols] categories = {1: 'Alpha', 2: 'Beta', 3: 'Charlie'} df[new_cols] = df[source_cols].applymap(categories.get) df
AAA BBB CCC AAA_cat BBB_cat CCC_cat 0 1 1 2 Alpha Alpha Beta 1 2 1 1 Beta Alpha Alpha 2 1 2 3 Alpha Beta Charlie 3 3 2 1 Charlie Beta Alpha

We can replicate the solution above in datatable:

# datatable import itertools as it DT = dt.Frame(df) mixer = it.product(DT.names, categories) conditions = [(name, f[name] == value, categories[value]) for name, value in mixer] for name, cond, value in conditions: DT[cond, f"{name}_cat"] = value
AAABBBCCCAAA_catBBB_catCCC_cat
int64int64int64str32str32str32
0112AlphaAlphaBeta
1211BetaAlphaAlpha
2123AlphaBetaCharlie
3321CharlieBetaAlpha

Keep other columns when using min() with groupby

# pandas df = pd.DataFrame({'AAA': [1, 1, 1, 2, 2, 2, 3, 3], 'BBB': [2, 1, 3, 4, 5, 1, 2, 3]}) df
AAA BBB 0 1 2 1 1 1 2 1 3 3 2 4 4 2 5 5 2 1 6 3 2 7 3 3

Solution in pandas:

df.loc[df.groupby("AAA")["BBB"].idxmin()]
AAA BBB 1 1 1 5 2 1 6 3 2

In datatable, you can sort() within a group, to achieve the same result above:

# datatable DT = dt.Frame(df) DT[0, :, by("AAA"), sort(f.BBB)]
AAABBB
int64int64
011
121
232

Apply to different items in a group

# pandas df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(), 'size': list('SSMMMLL'), 'weight': [8, 10, 11, 1, 20, 12, 12], 'adult': [False] * 5 + [True] * 2}) df
animal size weight adult 0 cat S 8 False 1 dog S 10 False 2 cat M 11 False 3 fish M 1 False 4 dog M 20 False 5 cat L 12 True 6 cat L 12 True

Solution in pandas:

def GrowUp(x): avg_weight = sum(x[x['size'] == 'S'].weight * 1.5) avg_weight += sum(x[x['size'] == 'M'].weight * 1.25) avg_weight += sum(x[x['size'] == 'L'].weight) avg_weight /= len(x) return pd.Series(['L', avg_weight, True], index=['size', 'weight', 'adult']) expected_df = gb.apply(GrowUp)
size weight adult animal cat L 12.4375 True dog L 20.0000 True fish L 1.2500 True

In datatable, we can use the ifelse() function to replicate the solution above, since it is based on a series of conditions:

DT = dt.Frame(df) conditions = ifelse(f.size == "S", f.weight * 1.5, f.size == "M", f.weight * 1.25, f.size == "L", f.weight, None) DT[:, {"size": "L", "avg_wt": dt.sum(conditions) / dt.count(), "adult": True}, by("animal")]
animalsizeavg_wtadult
str32str32float64bool8
0catL12.43751
1dogL201
2fishL1.251

Note

ifelse() can take multiple conditions, along with a default return value.

Note

Custom functions are not supported in datatable yet.

Sort groups by aggregated data

# pandas df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2, 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62], 'flag': [False, True] * 3})
code data flag 0 foo 0.16 False 1 bar -0.21 True 2 baz 0.33 False 3 foo 0.45 True 4 bar -0.59 False 5 baz 0.62 True

Solution in pandas:

code_groups = df.groupby('code') agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data') sorted_df = df.loc[agg_n_sort_order.index] sorted_df
code data flag 1 bar -0.21 True 4 bar -0.59 False 0 foo 0.16 False 3 foo 0.45 True 2 baz 0.33 False 5 baz 0.62 True

The solution above sorts the data based on the sum of the data column per group in the code column.

We can replicate this in datatable:

DT = dt.Frame(df) DT[:, update(sum_data = dt.sum(f.data)), by("code")] DT[:, :-1, sort(f.sum_data)]
codedataflag
str32float64bool8
0bar-0.211
1bar-0.590
2foo0.160
3foo0.451
4baz0.330
5baz0.621

Create a value counts column and reassign back to the DataFrame

# pandas df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(), 'Value': [100, 150, 50, 50]}) df
Color Value 0 Red 100 1 Red 150 2 Red 50 3 Blue 50

Solution in pandas:

df['Counts'] = df.groupby(['Color']).transform(len) df
Color Value Counts 0 Red 100 3 1 Red 150 3 2 Red 50 3 3 Blue 50 1

In datatable, you can replicate the solution above with the count() function:

DT = dt.Frame(df) DT[:, update(Counts=dt.count()), by("Color")] DT
ColorValueCounts
str32int64int64
0Red1003
1Red1503
2Red503
3Blue501

Shift groups of the values in a column based on the index

# pandas df = pd.DataFrame({'line_race': [10, 10, 8, 10, 10, 8], 'beyer': [99, 102, 103, 103, 88, 100]}, index=['Last Gunfighter', 'Last Gunfighter', 'Last Gunfighter', 'Paynter', 'Paynter', 'Paynter']) df
line_race beyer Last Gunfighter 10 99 Last Gunfighter 10 102 Last Gunfighter 8 103 Paynter 10 103 Paynter 10 88 Paynter 8 100

Solution in pandas:

df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1) df
line_race beyer beyer_shifted Last Gunfighter 10 99 NaN Last Gunfighter 10 102 99.0 Last Gunfighter 8 103 102.0 Paynter 10 103 NaN Paynter 10 88 103.0 Paynter 8 100 88.0

Datatable has an equivalent shift() function:

DT = dt.Frame(df.reset_index()) DT[:, update(beyer_shifted = dt.shift(f.beyer)), by("index")] DT
indexline_racebeyerbeyer_shifted
str32int64int64int64
0Last Gunfighter1099NA
1Last Gunfighter1010299
2Last Gunfighter8103102
3Paynter10103NA
4Paynter1088103
5Paynter810088

Frequency table like plyr in R

grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78] df = pd.DataFrame({'ID': ["x%d" % r for r in range(10)], 'Gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'], 'ExamYear': ['2007', '2007', '2007', '2008', '2008', '2008', '2008', '2009', '2009', '2009'], 'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'], 'Participated': ['yes', 'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes'], 'Passed': ['yes' if x > 50 else 'no' for x in grades], 'Employed': [True, True, True, False, False, False, False, True, True, False], 'Grade': grades}) df
ID Gender ExamYear Class Participated Passed Employed Grade 0 x0 F 2007 algebra yes no True 48 1 x1 M 2007 stats yes yes True 99 2 x2 F 2007 bio yes yes True 75 3 x3 M 2008 algebra yes yes False 80 4 x4 F 2008 algebra no no False 42 5 x5 M 2008 stats yes yes False 80 6 x6 F 2008 stats yes yes False 72 7 x7 M 2009 algebra yes yes True 68 8 x8 M 2009 bio yes no True 36 9 x9 M 2009 bio yes yes False 78

Solution in pandas:

df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'], 'Passed': lambda x: sum(x == 'yes'), 'Employed': lambda x: sum(x), 'Grade': lambda x: sum(x) / len(x)})
Participated Passed Employed Grade ExamYear 2007 3 2 3 74.000000 2008 3 3 0 68.500000 2009 3 2 2 60.666667

In datatable you can nest conditions within aggregations:

DT = dt.Frame(df)
DT[:, {"Participated": dt.sum(f.Participated == "yes"), "Passed": dt.sum(f.Passed == "yes"), "Employed": dt.sum(f.Employed), "Grade": dt.mean(f.Grade)}, by("ExamYear")]
ExamYearParticipatedPassedEmployedGrade
str32int64int64int64float64
0200732374
1200833068.5
2200932260.6667

Missing functionality

Listed below are some functions in pandas that do not have an equivalent in datatable yet, and are likely to be implemented:

If there are any functions that you would like to see in datatable, please head over to github and raise a feature request.

Comparison with R’s data.table

datatable is closely related to R’s data.table and attempts to mimic its API; however, there are differences due to language constraints.

This page shows how to perform similar basic operations in R’s data.table versus datatable.

Subsetting Rows

The examples used here are from the examples data in R’s data.table.

library(data.table) DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
from datatable import dt, f, g, by, update, join, sort DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3, y = [1, 3, 6] * 3, v = range(1, 10)) DT
xyv
str32int32int32
0b11
1b32
2b63
3a14
4a35
5a66
6c17
7c38
8c69

Action

data.table

datatable

Select 2nd row

DT[2]

DT[1, :]

Select 2nd and 3rd row

DT[2:3]

DT[1:3, :]

Select 3rd and 2nd row

DT[3:2]

DT[[2,1], :]

Select 2nd and 5th rows

DT[c(2,5)]

DT[[1,4], :]

Select all rows from 2nd to 5th

DT[2:5]

DT[[1:5, :]

Select rows in reverse from 5th to the 1st

DT[5:1]

DT[4::-1, :]

Select the last row

DT[.N]

DT[-1, :]

All rows where y > 2

DT[y>2]

DT[f.y>2, :]

Compound logical expressions

DT[y>2 & v>5]

DT[(f.y>2) & (f.v>5), :]

All rows other than rows 2,3,4

DT[!2:4] or DT[-(2:4)]

DT[[0, slice(4, None)], :]

Sort by column x, ascending

DT[order(x), ]

DT.sort("x") or
DT[:, :, sort("x")]

Sort by column x, descending

DT[order(-x)]

DT.sort(-f.x) or
DT[:, :, sort(-f.x)]

Sort by column x ascending, y descending

DT[order(x, -y)]

DT.sort(x, -f.y) or
DT[:, :, sort(f.x, -f.y)]

Note

Note the use of the f symbol when performing computations or sorting in descending order. You can read more about f-expressions.

Note

In R, DT[2] would mean 2nd row, whereas in python DT[2] would select the 3rd column.

In data.table, when selecting rows you do not need to indicate the columns. So, something like the code below works fine:

# data.table DT[y==3] x y v 1: b 3 2 2: a 3 5 3: c 3 8

In datatable, however, when selecting rows there has to be a column selector, or you get an error:

DT[f.y == 3]
TypeError: Column selector must be an integer or a string, not <class 'datatable.FExpr'>

The code above fails because datatable only allows single-column selection using the style above:

DT['y']
y
int32
01
13
26
31
43
56
61
73
86

As such, when datatable sees an f-expressions, it thinks you are selecting a column, and appropriately errors out.

Since, in this case, we are selecting all columns, we can use either a colon (:) or the Ellipsis symbol(...):

DT[f.y==3, :] DT[f.y==3, ...]

Selecting columns

Action

data.table

datatable

Select column v

DT[, .(v)]

DT[:, 'v'] or DT['v']

Select multiple columns

DT[, .(x,v)]

DT[:, ['x', 'v']]

Rename and select column

DT[, .(m = x)]

DT[:, {"m" : f.x}]

Sum column v and rename as sv

DT[, .(sv=sum(v))]

DT[:, {"sv": dt.sum(f.v)}]

Return two columns, v and v doubled

DT[, .(v, v*2)]

DT[:, [f.v, f.v*2]]

Select the second column

DT[, 2]

DT[:, 1] or DT[1]

Select last column

DT[, ncol(DT), with=FALSE]

DT[:, -1]

Select columns x through y

DT[, .SD, .SDcols=x:y]

DT[:, f["x":"y"]] or DT[:, 'x':'y']

Exclude columns x and y

DT[ , .SD, .SDcols = !x:y]

DT[:, [name not in ("x","y")
for name in DT.names]] or
DT[:, f[:].remove(f['x':'y'])]

Select columns that start with x or v

DT[ , .SD, .SDcols = patterns('^[xv]')]

DT[:, [name.startswith(("x", "v"))
for name in DT.names]]

In data.table, you can select a column by using a variable name with the double dots prefix:

col = 'v' DT[, ..col]

In datatable, you do not need the prefix:

col = 'v' DT[:, col] # or DT[col]
v
float64
01
11.41421
21.73205
32
42.23607
52.44949
62.64575
72.82843
83

If the column names are stored in a character vector, the double dots prefix also works:

cols = c('v', 'y') DT[, ..cols]

In datatable, you can store the list/tuple of column names in a variable

cols = ['v', 'y'] DT[:, cols]
vy
float64float64
011
11.414211.73205
21.732052.44949
321
42.236071.73205
52.449492.44949
62.645751
72.828431.73205
832.44949

Subset rows and Select/Aggregate

Action

data.table

datatable

Sum column v over rows 2 and 3

DT[2:3, .(sum(v))]

DT[1:3, dt.sum(f.v)]

Same as above, new column name

DT[2:3, .(sv=sum(v))]

DT[1:3, {"sv": dt.sum(f.v)}]

Filter in i and aggregate in j

DT[x=="b", .(sum(v*y))]

DT[f.x=="b", dt.sum(f.v * f.y)]

Same as above, return as scalar

DT[x=="b", sum(v*y)]

DT[f.x=="b", dt.sum(f.v * f.y)][0, 0]

In R indexing starts at 1 and when slicing, the first and the last items are both included. However, in Python, indexing starts at 0, and when slicing all items except the last are included.

Some SD (Subset of Data) operations can be replicated in datatable

Aggregate several columns

DT[, lapply(.SD, mean), .SDcols = c("y","v")] y v 1: 3.333333 5
DT[:, dt.mean([f.y,f.v])]
yv
float64float64
03.333335

Modify columns using a condition

DT[, .SD - 1, .SDcols = is.numeric] y v 1: 0 0 2: 2 1 3: 5 2 4: 0 3 5: 2 4 6: 5 5 7: 0 6 8: 2 7 9: 5 8
DT[:, f[int] - 1]
C0C1
int32int32
000
121
252
303
424
555
606
727
858

Modify several columns and keep others unchanged

DT[, c("y", "v") := lapply(.SD, sqrt), .SDcols = c("y", "v")] x y v 1: b 1.000000 1.000000 2: b 1.732051 1.414214 3: b 2.449490 1.732051 4: a 1.000000 2.000000 5: a 1.732051 2.236068 6: a 2.449490 2.449490 7: c 1.000000 2.645751 8: c 1.732051 2.828427 9: c 2.449490 3.000000
# there is a square root function the datatable math module DT[:, update(**{name:f[name]**0.5 for name in ("y","v")})] DT
xyv
str32float64float64
0b11
1b1.732051.41421
2b2.449491.73205
3a12
4a1.732052.23607
5a2.449492.44949
6c12.64575
7c1.732052.82843
8c2.449493

Grouping with by()

Action

data.table

datatable

Get the sum of column v grouped by column x

DT[, sum(v), by=x]

DT[:, dt.sum(f.v), by('x')]

Get sum of v where x != a

DT[x!="a", sum(v), by=x]

DT[f.x!="a", :][:, dt.sum(f.v), by("x")]

Number of rows per group

DT[, .N, by=x]

DT[:, dt.count(), by("x")]

Select first row of y and v for each group in x

DT[, .SD[1], by=x]

DT[0, :, by('x')]

Get row count and sum columns v and y by group

DT[, c(.N, lapply(.SD, sum)), by=x]

DT[:, [dt.count(), dt.sum(f[:])], by("x")]

Expressions in by()

DT[, sum(v), by=.(y%%2)]

DT[:, dt.sum(f.v), by(f.y%2)]

Get row per group where column v is minimum

DT[, .SD[which.min(v)], by=x]

DT[0, f[:], by("x"), dt.sort(f.v)]

First 2 rows of each group

DT[, head(.SD,2), by=x]

DT[:2, :, by("x")]

Last 2 rows of each group

DT[, tail(.SD,2), by=x]

DT[-2:, :, by("x")]

In R’s data.table, the order of the groupings is preserved; in datatable, the returned dataframe is sorted on the grouping column. DT[, sum(v), keyby=x] in data.table returns a dataframe ordered by column x.

In data.table, i is executed before the grouping, while in datatable, i is executed after the grouping.

Also, in datatable, f-expressions in the i section of a groupby is not yet implemented, hence the chaining method to get the sum of column v where x!=a.

Multiple aggregations within a group can be executed in R’s data.table with the syntax below:

DT[, list(MySum=sum(v), MyMin=min(v), MyMax=max(v)), by=.(x, y%%2)]

The same can be replicated in datatable by using a dictionary:

DT[:, {'MySum': dt.sum(f.v), 'MyMin': dt.min(f.v), 'MyMax': dt.max(f.v)}, by(f.x, f.y%2)]

Add/Update/Delete Columns

Action

data.table

datatable

Add new column

DT[, z:=42L]

DT[:, update(z=42)] or
DT['z'] = 42 or
DT[:, 'z'] = 42 or
DT = DT[:, f[:].extend({"z":42})]

Add multiple columns

DT[, c('sv','mv') := .(sum(v), "X")]

DT[:, update(sv = dt.sum(f.v), mv = "X")] or
DT[:, f[:].extend({"sv": dt.sum(f.v), "mv": "X"})]

Remove column

DT[, z:=NULL]

del DT['z'] or
del DT[:, 'z'] or
DT = DT[:, f[:].remove(f.z)]

Subassign to existing v column

DT["a", v:=42L, on="x"]

DT[f.x=="a", update(v=42)] or
DT[f.x=="a", 'v'] = 42

Subassign to new column (NA padded)

DT["b", v2:=84L, on="x"]

DT[f.x=="b", update(v2=84)] or
DT[f.x=='b', 'v2'] = 84

Add new column, assigning values group-wise

DT[, m:=mean(v), by=x]

DT[:, update(m=dt.mean(f.v)), by("x")]

In data.table, you can create a new column with a variable

col = 'rar' DT[, ..col:=4242]

Similar operation for the above in datatable:

col = 'rar' DT[col] = 4242 # or DT[:, update(col = 4242)]

Note

The update() function, as well as the del operator operate in-place; there is no need for reassignment. Another advantage of the update() method is that the row order of the dataframe is not changed, even in a groupby; this comes in handy in a lot of transformation operations.

Joins

At the moment, only the left outer join is implemented in datatable. Another aspect is that the dataframe being joined must be keyed, the column or columns to be keyed must not have duplicates, and the joining column has to have the same name in both dataframes. You can read more about the join() API and have a look at the Tutorial on join operators.

Left join in R’s data.table:

DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9) X = data.table(x=c("c","b"), v=8:7, foo=c(4,2)) X[DT, on="x"] x v foo y i.v 1: b 7 2 1 1 2: b 7 2 3 2 3: b 7 2 6 3 4: a NA NA 1 4 5: a NA NA 3 5 6: a NA NA 6 6 7: c 8 4 1 7 8: c 8 4 3 8 9: c 8 4 6 9

Join in datatable:

DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3, y = [1, 3, 6] * 3, v = range(1, 10)) X = dt.Frame({"x":('c','b'), "v":(8,7), "foo":(4,2)}) X.key = "x" DT[:, :, join(X)]
xyvv.0foo
str32int32int32int32int32
0b1172
1b3272
2b6372
3a14NANA
4a35NANA
5a66NANA
6c1784
7c3884
8c6984

An inner join could be simulated by removing the nulls. Again, a join() only works if the joining dataframe is keyed.

DT[X, on="x", nomatch=NULL] x y v i.v foo 1: c 1 7 8 4 2: c 3 8 8 4 3: c 6 9 8 4 4: b 1 1 7 2 5: b 3 2 7 2 6: b 6 3 7 2
DT[g[-1] != None, :, join(X)] # g refers to the joining dataframe X
xyvv.0foo
str32int32int32int32int32
0b1172
1b3272
2b6372
3c1784
4c3884
5c6984

A not join can be simulated as well:

DT[!X, on="x"] x y v 1: a 1 4 2: a 3 5 3: a 6 6
DT[g[-1]==None, f[:], join(X)]
xyv
str32int32int32
0a14
1a35
2a66

Select the first row for each group:

DT[X, on="x", mult="first"] x y v i.v foo 1: c 1 7 8 4 2: b 1 1 7 2
DT[g[-1] != None, :, join(X)][0, :, by('x')] # chaining comes in handy here
xyvv.0foo
str32int32int32int32int32
0b1172
1c1784

Select the last row for each group:

DT[X, on="x", mult="last"] x y v i.v foo 1: c 6 9 8 4 2: b 6 3 7 2
DT[g[-1]!=None, :, join(X)][-1, :, by('x')]
xyvv.0foo
str32int32int32int32int32
0b6372
1c6984

Join and evaluate j for each row in i:

DT[X, sum(v), by=.EACHI, on="x"] x V1 1: c 24 2: b 6
DT[g[-1]!=None, :, join(X)][:, dt.sum(f.v), by("x")]
xv
str32int64
0b6
1c24

Aggregate on columns from both dataframes in j:

DT[X, sum(v)*foo, by=.EACHI, on="x"] x V1 1: c 96 2: b 12
DT[:, dt.sum(f.v*g.foo), join(X), by(f.x)][f[-1]!=0, :]
xC0
str32int64
0b12
1c96

Aggregate on columns with same name from both dataframes in j:

DT[X, sum(v)*i.v, by=.EACHI, on="x"] x V1 1: c 192 2: b 42
DT[:, dt.sum(f.v*g.v), join(X), by(f.x)][f[-1]!=0, :]
xC0
str32int64
0b42
1c192

Expect significant improvement in join functionality, with more concise syntax, as well as additions of more features in the future.

Functions in R/data.table not yet implemented

This is a list of some functions in data.table that do not have an equivalent in datatable yet, that we would likely implement

Also, at the moment, custom aggregations in the j section are not supported in datatable – we intend to implement that at some point.

There are no datetime functions in datatable, and string operations are limited as well.

If there are any functions that you would like to see in datatable, please head over to github and raise a feature request.

Comparison with SQL

This page provides some examples of how various SQL operations can be performed in datatable. The datatable library is still growing; as such, not all functions in SQL can be replicated yet. If there is a feature you would love to have in datatable, please make a feature request on the github issues page.

Most of the examples will be based on the famous iris dataset. SQLite will be the flavour of SQL used in the comparison.

Let’s import datatable and read in the data using its fread() function:

from datatable import dt, f, g, by, join, sort, update, fread iris = fread('https://raw.githubusercontent.com/h2oai/datatable/main/docs/_static/iris.csv') iris
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
105.43.71.50.2setosa
114.83.41.60.2setosa
124.831.40.1setosa
134.331.10.1setosa
145.841.20.2setosa
1456.735.22.3virginica
1466.32.551.9virginica
1476.535.22virginica
1486.23.45.42.3virginica
1495.935.11.8virginica

Loading data into an SQL table is a bit more involved, where you need to create the structure of the table (a schema), before importing the csv file. Have a look at SQLite import tutorial for an example on loading data into a SQLite datatabase.

SELECT

In SQL, you can select a subset of the columns with the SELECT clause:

SELECT sepal_length, sepal_width, petal_length FROM iris LIMIT 5;

In datatable, columns are selected in the j section:

iris[:5, ['sepal_length', 'sepal_width', 'petal_length']]
sepal_lengthsepal_widthpetal_length
float64float64float64
05.13.51.4
14.931.4
24.73.21.3
34.63.11.5
453.61.4

In SQL, you can select all columns with the * symbol:

SELECT * FROM iris LIMIT 5;

In datatable, all columns can be selected with a simple “select-all” slice :, or with f-expressions:

iris[:5, :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa

If you are selecting a single column, datatable allows you to access just the j section within the square brackets; you do not need to include the i section: DT[j]

SELECT sepal_length FROM iris LIMIT 5;
# datatable iris['sepal_length'].head(5)
sepal_length
float64
05.1
14.9
24.7
34.6
45

How about adding new columns? In SQL, this is done also in the SELECT clause:

SELECT *, sepal_length*2 AS sepal_length_doubled FROM iris LIMIT 5;

In datatable, addition of new columns occurs in the j section:

iris[:5, f[:].extend({"sepal_length_doubled": f.sepal_length * 2})]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
05.13.51.40.2setosa10.2
14.931.40.2setosa9.8
24.73.21.30.2setosa9.4
34.63.11.50.2setosa9.2
453.61.40.2setosa10

The update() function can also be used to add new columns. The operation occurs in-place; reassignment is not required:

iris[:, update(sepal_length_doubled = f.sepal_length * 2)] iris[:5, :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
05.13.51.40.2setosa10.2
14.931.40.2setosa9.8
24.73.21.30.2setosa9.4
34.63.11.50.2setosa9.2
453.61.40.2setosa10

WHERE

Filtering in SQL is done via the WHERE clause.

SELECT * FROM iris WHERE species = 'virginica' LIMIT 5;

In datatable, filtration is done in the i section:

iris[f.species=="virginica", :].head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
06.33.362.5virginica12.6
15.82.75.11.9virginica11.6
27.135.92.1virginica14.2
36.32.95.61.8virginica12.6
46.535.82.2virginica13

Note that in SQL, equality comparison is done with the = symbol, whereas in python, it is with the == operator. You can filter with multple conditions too:

SELECT * FROM iris WHERE species = 'setosa' AND sepal_length = 5;

In datatable each condition is wrapped in parentheses; the & operator is the equivalent of AND, while | is the equivalent of OR:

iris[(f.species=="setosa") & (f.sepal_length==5), :]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
053.61.40.2setosa10
153.41.50.2setosa10
2531.60.2setosa10
353.41.60.4setosa10
453.21.20.2setosa10
553.51.30.3setosa10
653.51.60.6setosa10
753.31.40.2setosa10

Now suppose you have a frame where some values are missing (NA):

null_data = dt.Frame(""" a b c 1 2 3 1 NaN 4 2 1 3 1 2 2""") null_data
abc
int32float64int32
0123
11NA4
2213
3122

In SQL you could filter out those values like this:

SELECT * FROM null_data WHERE b is NOT NULL;

In datatable, the NOT operator is replicated with the != symbol:

null_data[f.b!=None, :]
abc
int32float64int32
0123
1213
2122

You could also use isna function with the ~ operator which inverts boolean expressions:

null_data[~dt.math.isna(f.b), :]
abc
int32float64int32
0123
1213
2122

Keeping the null rows is easily achievable; it is simply the inverse of the above code:

SELECT * FROM null_data WHERE b is NULL;
null_data[dt.isna(f.b), :]
abc
int32float64int32
01NA4
null_data[dt.isna(f.b), :]

Note

SQL has the IN operator, which does not have an equivalent in datatable yet.

ORDER BY

In SQL, sorting is executed with the ORDER BY clause, while in datatable it is handled by the sort() function.

SELECT * FROM iris ORDER BY sepal_length ASC limit 5;
iris[:5, :, sort('sepal_length')]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
04.331.10.1setosa8.6
14.42.91.40.2setosa8.8
24.431.30.2setosa8.8
34.43.21.30.2setosa8.8
44.52.31.30.3setosa9

Sorting in descending order in SQL is with the DESC.

SELECT * FROM iris ORDER BY sepal_length DESC limit 5;

In datatable, this can be achieved in two ways:

iris[:5, :, sort('sepal_length', reverse=True)]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
07.93.86.42virginica15.8
17.73.86.72.2virginica15.4
27.72.66.92.3virginica15.4
37.72.86.72virginica15.4
47.736.12.3virginica15.4

or, you could negate the sorting column; datatable will correctly interprete the negation(-) as descending order:

iris[:5, :, sort(-f.sepal_length)]
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_length_doubled
float64float64float64float64str32float64
07.93.86.42virginica15.8
17.73.86.72.2virginica15.4
27.72.66.92.3virginica15.4
37.72.86.72virginica15.4
47.736.12.3virginica15.4

GROUP BY

SQL’s GROUP BY operations can be performed in datatable with the by() function. Have a look at the by() API, as well as the Grouping with by() user guide.

Let’s look at some common grouping operations in SQL, and their equivalents in datatable.

Single aggregation per group

SELECT species, COUNT() AS N FROM iris GROUP BY species;
iris[:, dt.count(), by('species')]
speciescount
str32int64
0setosa50
1versicolor50
2virginica50

Multiple aggregations per group

SELECT species, COUNT() AS N, AVG(sepal_length) AS mean_sepal_length FROM iris GROUP BY species;
iris[:, {"mean_sepal_length": dt.mean(f.sepal_length), "N": dt.count()}, by('species')]
speciesmean_sepal_lengthN
str32float64int64
0setosa5.00650
1versicolor5.93650
2virginica6.58850

Grouping on multiple columns

fruits_data
FruitDateNameNumber
str32str32str32int32
0Apples10/6/2016Bob7
1Apples10/6/2016Bob8
2Apples10/6/2016Mike9
3Apples10/7/2016Steve10
4Apples10/7/2016Bob1
5Oranges10/7/2016Bob2
6Oranges10/6/2016Tom15
7Oranges10/6/2016Mike57
8Oranges10/6/2016Bob65
9Oranges10/7/2016Tony1
10Grapes10/7/2016Bob1
11Grapes10/7/2016Tom87
12Grapes10/7/2016Bob22
13Grapes10/7/2016Bob12
14Grapes10/7/2016Tony15
SELECT fruit, name, SUM(number) AS sum_num FROM fruits_data GROUP BY fruit, name;
fruits_data[:, {"sum_num": dt.sum(f.Number)}, by('Fruit', 'Name')]
FruitNamesum_num
str32str32int64
0ApplesBob16
1ApplesMike9
2ApplesSteve10
3GrapesBob35
4GrapesTom87
5GrapesTony15
6OrangesBob67
7OrangesMike57
8OrangesTom15
9OrangesTony1

WHERE with GROUP BY

SELECT species, AVG(sepal_length) AS avg_sepal_length FROM iris WHERE sepal_width > 3 GROUP BY species;
iris[f.sepal_width >=3, :][:, {"avg_sepal_length": dt.mean(f.sepal_length)}, by('species')]
speciesavg_sepal_length
str32float64
0setosa5.02917
1versicolor6.21875
2virginica6.76897

HAVING with GROUP BY

SELECT fruit, name, SUM(number) AS sum_num FROM fruits_data GROUP BY fruit, name HAVING sum_num > 50;
fruits_data[:, {'sum_num': dt.sum(f.Number)}, by('Fruit','Name')][f.sum_num > 50, :]
FruitNamesum_num
str32str32int64
0GrapesTom87
1OrangesBob67
2OrangesMike57

Grouping on a condition

SELECT sepal_width >=3 AS width_larger_than_3, AVG(sepal_length) AS avg_sepal_length FROM iris GROUP BY sepal_width>=3;
iris[:, {"avg_sepal_length": dt.mean(f.sepal_length)}, by(f.sepal_width >= 3)]
C0avg_sepal_length
bool8float64
005.95263
115.77634

At the moment, names cannot be assigned in the by section.

LEFT OUTER JOIN

We will compare the left outer join, as that is the only join currently implemented in datatable. Another aspect is that the frame being joined must be keyed, the column or columns to be keyed must not have duplicates, and the joining column has to have the same name in both frames. You can read more about the join() API and have a look at the join(…).

Example data:

DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3, y = [1, 3, 6] * 3, v = range(1, 10)) X = dt.Frame({"x":('c','b'), "v":(8,7), "foo":(4,2)})

A left outer join in SQL:

SELECT DT.x, DT.y, DT.v, X.foo FROM DT left JOIN X ON DT.x = X.x

A left outer join in datatable:

X.key = 'x' DT[:, [f.x, f.y, f.v, g.foo], join(X)]
xyvfoo
str32int32int32int32
0b112
1b322
2b632
3a14NA
4a35NA
5a66NA
6c174
7c384
8c694

UNION

The UNION ALL clause in SQL can be replicated in datatable with rbind().

SELECT x, v FROM DT UNION ALL SELECT x, v FROM x

In datatable, rbind() takes a list/tuple of frames and lumps into one:

dt.rbind([DT[:, ('x','v')], X[:, ('x', 'v')]])
xv
str32int32
0b1
1b2
2b3
3a4
4a5
5a6
6c7
7c8
8c9
9b7
10c8

SQL’s UNION removes duplicate rows after combining the results of the individual queries; there is no built-in function in datatable yet that handles duplicates.

SQL’s WINDOW functions

Some SQL window functions can be replicated in datatable (rank is one of the windows function not currently implemented in datatable) :

  • TOP n rows per group

SELECT * from (SELECT *, ROW_NUMBER() OVER(PARTITION BY species ORDER BY sepal_length DESC) AS row_num FROM iris) WHERE row_num < 3;
iris[:3, :, by('species'), sort(-f.sepal_length)]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
str32float64float64float64float64
0setosa5.841.20.2
1setosa5.74.41.50.4
2setosa5.73.81.70.3
3versicolor73.24.71.4
4versicolor6.93.14.91.5
5versicolor6.82.84.81.4
6virginica7.93.86.42
7virginica7.73.86.72.2
8virginica7.72.66.92.3

Filter for rows above the mean sepal length:

SELECT sepal_length, sepal_width, petal_length, petal_width, species FROM (SELECT *, AVG(sepal_length) OVER (PARTITION BY species) AS avg_sepal_length FROM iris) WHERE sepal_length > avg_sepal_length LIMIT 5;
iris[:, update(temp = f.sepal_length > dt.mean(f.sepal_length)), by('species')] iris[f.temp == 1, f[:-1]].head(5)
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
15.43.91.70.4setosa
25.43.71.50.2setosa
35.841.20.2setosa
45.74.41.50.4setosa

Lead and lag

SELECT name, destination, dep_date, LEAD(dep_date) OVER (ORDER BY dep_date, name) AS lead1, LEAD(dep_date, 2) OVER (ORDER BY dep_date, name) AS lead2, LAG(dep_date) OVER (ORDER BY dep_date, name) AS lag1, LAG(dep_date, 3) OVER (ORDER BY dep_date, name) AS lag3 FROM source_data;
source_data = dt.Frame({'name': ['Ann', 'Ann', 'Ann', 'Bob', 'Bob'], 'destination': ['Japan', 'Korea', 'Switzerland', 'USA', 'Switzerland'], 'dep_date': ['2019-02-02', '2019-01-01', '2020-01-11', '2019-05-05', '2020-01-11'], 'duration': [7, 21, 14, 10, 14]}) source_data[:, f[:].extend({"lead1": dt.shift(f.dep_date, -1), "lead2": dt.shift(f.dep_date, -2), "lag1": dt.shift(f.dep_date), "lag3": dt.shift(f.dep_date,3) }), sort('dep_date','name')]
namedestinationdep_datedurationlead1lead2lag1lag3
str32str32str32int32str32str32str32str32
0AnnKorea2019-01-01212019-02-022019-05-05NANA
1AnnJapan2019-02-0272019-05-052020-01-112019-01-01NA
2BobUSA2019-05-05102020-01-112020-01-112019-02-02NA
3AnnSwitzerland2020-01-11142020-01-11NA2019-05-052019-01-01
4BobSwitzerland2020-01-1114NANA2020-01-112019-02-02

The equivalent of SQL’s LAG is shift() with a positive number, while SQL’s LEAD is shift() with a negative number.

Note

datatable does not natively support datetimes yet.

Total sum and the proportions:

proportions = dt.Frame({"t": [1, 2, 3]}) proportions
t
int32
01
12
23
SELECT t, SUM(t) OVER () AS sum, CAST(t as FLOAT)/SUM(t) OVER () AS pct FROM proportions;
proportions[:, f[:].extend({"sum": dt.sum(f.t), "pct": f.t/dt.sum(f.t)})]
tsumpct
int32int64float64
0160.166667
1260.333333
2360.5

Dates and time

Added in version 1.0.0

datatable has several builtin types to support working with date/time variables.

date32

The date32 type is used to represent a particular calendar date without a time component. Internally, this type is stored as a 32-bit integer containing the number of days since the epoch (Jan 1, 1970). Thus, this type accommodates dates within the range of approximately ±5.8 million years.

The calendar used for this type is proleptic Gregorian, meaning that it extends the modern-day Gregorian calendar into the past before this calendar was first adopted.

time64

The time64 type is used to represent a specific moment in time. This corresponds to datetime in Python, or timestamp in Arrow or pandas. Internally, this type is stored as a 64-bit integer containing the number of milliseconds since the epoch (Jan 1, 1970) in UTC.

This type is not leap-seconds aware, meaning that it assumes that each day has exactly 24×3600 seconds. In practice it means that calculating time difference between two time64 moments may be off by the number of leap seconds that have occurred between them.

A time64 column may also carry a time zone as meta information. This time zone is used to convert the timestamp from the absolute UTC time to the local calendar. For example, suppose you have two time64 columns: one is in UTC while the other is in America/Los_Angeles time zone. Assume both columns store the same value 1577836800000. Then these two columns represent the same moment in time, however their calendar representations are different: 2020-01-01T00:00:00Z and 2019-12-31T16:00:00-0800 respectively.

FTRL Model

This section provides a brief introduction to the FTRL (Follow the Regularized Leader) model as implemented in datatable. For detailed information on API, please refer to the Ftrl Python class documentation.

FTRL Model Information

The Follow the Regularized Leader (FTRL) model is a datatable implementation of the FTRL-Proximal online learning algorithm for binomial logistic regression. It uses a hashing trick for feature vectorization and the Hogwild approach for parallelization. FTRL for multinomial classification and continuous targets are implemented experimentally.

Create an FTRL Model

The FTRL model is implemented as the Ftrl Python class, which is a part of dt.models, so to use the model you should first do:

from datatable.models import Ftrl

and then create a model as:

ftrl_model = Ftrl()

FTRL Model Parameters

The FTRL model requires a list of parameters for training and making predictions, namely:

  • alpha – learning rate, defaults to 0.005.

  • beta – beta parameter, defaults to 1.0.

  • lambda1 – L1 regularization parameter, defaults to 0.0.

  • lambda2 – L2 regularization parameter, defaults to 1.0.

  • nbins – the number of bins for the hashing trick, defaults to 10**6.

  • mantissa_nbits – the number of bits from mantissa to be used for hashing, defaults to 10.

  • nepochs – the number of epochs to train the model for, defaults to 1.

  • negative_class – whether to create and train on a “negative” class in the case of multinomial classification, defaults to False.

  • interactions — a list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. This setting defaults to None.

  • model_type — training mode that can be one of the following: "auto" to automatically set model type based on the target column data, "binomial" for binomial classification, "multinomial" for multinomial classification or "regression" for continuous targets. Defaults to "auto".

If some parameters need to be changed from their default values, this can be done either when creating the model, as

ftrl_model = Ftrl(alpha = 0.1, nbins = 100)

or, if the model already exists, as

ftrl_model.alpha = 0.1 ftrl_model.nbins = 100

If some parameters were not set explicitely, they will be assigned the default values.

Training a Model

Use the fit() method to train a model:

ftrl_model.fit(X_train, y_train)

where X_train is a frame of shape (nrows, ncols) to be trained on, and y_train is a target frame of shape (nrows, 1). The following datatable column types are supported for the X_train frame: bool, int, real and str.

FTRL model can also do early stopping, if relative validation error does not improve. For this the model should be fit as

res = ftrl_model.fit(X_train, y_train, X_validation, y_validation, nepochs_validation, validation_error, validation_average_niterations)

where X_train and y_train are training and target frames, respectively, X_validation and y_validation are validation frames, nepochs_validation specifies how often, in epoch units, validation error should be checked, validation_error is the relative validation error improvement that the model should demonstrate within nepochs_validation to continue training, and validation_average_niterations is the number of iterations to average when calculating the validation error. Returned res tuple contains epoch at which training stopped and the corresponding loss.

Resetting a Model

Use the reset() method to reset a model:

ftrl_model.reset()

This will reset model weights, but it will not affect learning parameters. To reset parameters to default values, you can do

ftrl_model.params = Ftrl().params

Making Predictions

Use the predict() method to make predictions:

targets = ftrl_model.predict(X)

where X is a frame of shape (nrows, ncols) to make predictions for. X should have the same number of columns as the training frame. The predict() method returns a new frame of shape (nrows, 1) with the predicted probability for each row of frame X.

Feature Importances

To estimate feature importances, the overall weight contributions are calculated feature-wise during training and predicting. Feature importances can be accessed as

fi = ftrl_model.feature_importances

where fi will be a frame of shape (nfeatures, 2) containing feature names and their importances, that are normalized to [0; 1] range.

Feature Interactions

By default each column of a training dataset is considered as a feature by FTRL model. User can provide additional features by specifying a list or a tuple of feature interactions, for instance as

ftrl_model.interactions = [["C0", "C1", "C3"], ["C2", "C5"]]

where C* are column names from a training dataset. In the above example two additional features, namely, C0:C1:C3 and C2:C5, are created.

interactions should be set before a call to fit() method, and can not be changed once the model is trained.

datatable API

Symbols listed here are available for import from the root of the datatable module.

Submodules

exceptions.

datatable warnings and exceptions.

internal.

Access to some internal details of datatable module.

math.

Mathematical functions, similar to python’s math module.

models.

A small set of data analysis tools.

re.

Functions using regular expressions.

str.

Functions for working with string columns.

time.

Functions for working with date/time columns.

Classes

Frame

Main “table of data” class. This is the equivalent of pandas’ or Julia’s DataFrame, R’s data.table or tibble, SQL’s TABLE, etc.

FExpr

Helper class for computing formulas over a frame.

Namespace

Helper class for addressing columns in a frame.

Type

Column’s type, similar to numpy’s dtype.

stype

[DEPRECATED] Enum of column “storage” types.

ltype

[DEPRECATED] Enum of column “logical” types.

Functions

fread()

Read CSV/text/XLSX/Jay/other files

iread()

Same as fread(), but read multiple files at once

by()

Group-by clause for use in Frame’s square-bracket selector

join()

Join clause for use in Frame’s square-bracket selector

sort()

Sort clause for use in Frame’s square-bracket selector

update()

Create new or update existing columns within a frame

cbind()

Combine frames by columns

rbind()

Combine frames by rows

repeat()

Concatenate frame by rows

as_type()

Cast column into another type

ifelse()

Ternary if operator

shift()

Shift column by a given number of rows

cut()

Bin a column into equal-width intervals

qcut()

Bin a column into equal-population intervals

split_into_nhot()

[DEPRECATED] Split and nhot-encode a single-column frame

init_styles()

Inject datatable’s stylesheets into the Jupyter notebook

rowall()

Row-wise all() function

rowany()

Row-wise any() function

rowcount()

Calculate the number of non-missing values per row

rowfirst()

Find the first non-missing value row-wise

rowlast()

Find the last non-missing value row-wise

rowmax()

Find the largest element row-wise

rowmean()

Calculate the mean value row-wise

rowmin()

Find the smallest element row-wise

rowsd()

Calculate the standard deviation row-wise

rowsum()

Calculate the sum of all values row-wise

intersect()

Calculate the set intersection of values in the frames

setdiff()

Calculate the set difference between the frames

symdiff()

Calculate the symmetric difference between the sets of values in the frames

union()

Calculate the union of values in the frames

unique()

Find unique values in a frame

corr()

Calculate correlation between two columns

count()

Count non-missing values per a column

cov()

Calculate covariance between two columns

max()

Find the largest element per a column

mean()

Calculate mean value per a column

median()

Find the median element per a column

min()

Find the smallest element per a column

sd()

Calculate the standard deviation per a column

sum()

Calculate the sum of all values per a column

Other

build_info

Information about the build of the datatable module.

dt

The datatable module itself.

f

The primary namespace used during DT[...] call.

g

Secondary namespace used during DT[..., join()] call.

options

datatable options.

datatable.exceptions

This module contains warnings and exceptions that datatable may throw during runtime.

Exceptions

Exceptions are thrown when a special condition, that is unexpected, encountered during runtime. All datatable exceptions are descendants of DtException class, so that they can be easily catched. The following exceptions may be thrown:

ImportError

Equivalient to the built-in ImportError.

IndexError

Equivalient to the built-in IndexError.

InvalidOperationError

The operation requested is illegal for the given combination of parameters.

IOError

Equivalient to the built-in IOError.

KeyError

Equivalient to the built-in KeyError.

MemoryError

Equivalient to the built-in MemoryError.

NotImplementedError

Equivalient to the built-in NotImplementedError.

OverflowError

Equivalient to the built-in OverflowError.

TypeError

Equivalient to the built-in TypeError.

ValueError

Equivalient to the built-in ValueError.

Warnings

Warnings are issued when it is helpful to inform the user of some condition in a program, that doesn’t result in an exception and the program termination. We may issue the following warnings:

FutureWarning

A built-in python warning about deprecated features.

DatatableWarning

A datatable generic warning.

IOWarning

A warning regarding the input/output operation.

datatable.exceptions.DtException
class
DtException

Base class for all exceptions raised by datatable.

datatable.exceptions.DatatableWarning
class
DatatableWarning

Generic warning from the datatable.

datatable.exceptions.ImportError
class
ImportError

This exception may be raised when a datatable operation requires an external module or library, but that module is not available. Examples of such operations include: converting a Frame into a pandas DataFrame, or into an Arrow Table, or reading an Excel file.

Inherits from Python ImportError and dt.exceptions.DtException.

datatable.exceptions.InvalidOperationError
class
InvalidOperationError

Raised in multiple scenarios whenever the requested operation is logically invalid with the given combination of parameters.

For example, cbind-ing several frames with incompatible shapes.

Inherits from dt.exceptions.DtException.

datatable.exceptions.IndexError
class
IndexError

Raised when accessing an element of a frame by index, but the value of the index falls outside of the boundaries of the frame.

Inherits from Python IndexError and dt.exceptions.DtException.

datatable.exceptions.IOError

Raised during any IO operation, such as reading/writing CSV or Jay files. The most common cause for such an error is an invalid input file.

Inherits from Python IOError and dt.exceptions.DtException.

datatable.exceptions.IOWarning
class
IOWarning

This warning is raised whenever you read an input file and there are some irregularities in the input that we can recover from, but perhaps the user should be informed that something wasn’t quite right.

datatable.exceptions.KeyError
class
KeyError

Raised when accessing a column of a frame by name, but the name lookup fails to find such a column.

Inherits from Python KeyError and dt.exceptions.DtException.

datatable.exceptions.MemoryError
class
MemoryError

This exception is raised whenever any operation fails to allocate the required amount of memory.

Inherits from Python MemoryError and dt.exceptions.DtException.

datatable.exceptions.NotImplementedError
class
NotImplementedError

Raised whenever an operation with given parameter values or input types is in theory valid, but hasn’t been implemented yet.

Inherits from Python NotImplementedError and dt.exceptions.DtException.

datatable.exceptions.OverflowError
class
OverflowError

Rare error that may occur if you pass a parameter that is too large to fit into C++ int64 type, or sometimes larger than a double.

Inherits from Python OverflowError and dt.exceptions.DtException.

datatable.exceptions.TypeError
class
TypeError

One of the most common exceptions raised by datatable, this occurs when either a function receives an argument of unexpected type, or incorrect number of arguments, or whenever an operation is requested on a column whose type is not suitable for that operation.

Inherits from Python TypeError and dt.exceptions.DtException.

datatable.exceptions.ValueError
class
ValueError

Very common exception that occurs whenever an argument is passed to a function and that argument has the correct type, yet the value is not valid.

Inherits from Python ValueError and dt.exceptions.DtException.

datatable.internal

Warning

The functions in this sub-module are considered to be “internal” and not useful for day-to-day work with datatable module.

frame_column_data_r()

C pointer to column’s data

frame_columns_virtual()

Indicators of which columns in the frame are virtual.

frame_integrity_check()

Run checks on whether the frame’s state is corrupted.

get_thread_ids()

Get ids of threads spawned by datatable.

datatable.internal.frame_column_data_r()

frame_column_data_r
(

Return C pointer to the main data array of the column frame[i]. The column will be materialized if it was virtual.

Parameters
frame
Frame

The dt.Frame where to look up the column.

i
int

The index of a column, in the range [0; ncols).

return
ctypes.c_void_p

The pointer to the column’s internal data.

datatable.internal.frame_columns_virtual()

frame_columns_virtual
(
Deprecated since version 0.11.0

Return the list indicating which columns in the frame are virtual.

Parameters
return
List[bool]

Each element in the list indicates whether the corresponding column is virtual or not.

Notes

This function will be expanded and moved into the main dt.Frame class.

datatable.internal.frame_integrity_check()

frame_integrity_check
(

This function performs a range of tests on the frame to verify that its internal state is consistent. It returns None on success, or throws an AssertionError if any problems were found.

Parameters
frame
Frame

A dt.Frame object that needs to be checked for internal consistency.

return
None
except
AssertionError

An exception is raised if there were any issues with the frame.

datatable.internal.get_thread_ids()

Return system ids of all threads used internally by datatable.

Calling this function will cause the threads to spawn if they haven’t done already. (This behavior may change in the future).

Parameters
return
List[str]

The list of thread ids used by the datatable. The first element in the list is the id of the main thread.

See Also

datatable.math

Trigonometric functions

sin(x)

Compute \(\sin x\) (the trigonometric sine of x).

cos(x)

Compute \(\cos x\) (the trigonometric cosine of x).

tan(x)

Compute \(\tan x\) (the trigonometric tangent of x).

arcsin(x)

Compute \(\sin^{-1} x\) (the inverse sine of x).

arccos(x)

Compute \(\cos^{-1} x\) (the inverse cosine of x).

arctan(x)

Compute \(\tan^{-1} x\) (the inverse tangent of x).

atan2(x, y)

Compute \(\tan^{-1} (x/y)\).

hypot(x, y)

Compute \(\sqrt{x^2 + y^2}\).

deg2rad(x)

Convert an angle measured in degrees into radians.

rad2deg(x)

Convert an angle measured in radians into degrees.

Hyperbolic functions

sinh(x)

Compute \(\sinh x\) (the hyperbolic sine of x).

cosh(x)

Compute \(\cosh x\) (the hyperbolic cosine of x).

tanh(x)

Compute \(\tanh x\) (the hyperbolic tangent of x).

arsinh(x)

Compute \(\sinh^{-1} x\) (the inverse hyperbolic sine of x).

arcosh(x)

Compute \(\cosh^{-1} x\) (the inverse hyperbolic cosine of x).

artanh(x)

Compute \(\tanh^{-1} x\) (the inverse hyperbolic tangent of x).

Exponential/logarithmic functions

exp(x)

Compute \(e^x\) (the exponent of x).

exp2(x)

Compute \(2^x\).

expm1(x)

Compute \(e^x - 1\).

log(x)

Compute \(\ln x\) (the natural logarithm of x).

log10(x)

Compute \(\log_{10} x\) (the decimal logarithm of x).

log1p(x)

Compute \(\ln(1 + x)\).

log2(x)

Compute \(\log_{2} x\) (the binary logarithm of x).

logaddexp(x)

Compute \(\ln(e^x + e^y)\).

logaddexp2(x)

Compute \(\log_2(2^x + 2^y)\).

cbrt(x)

Compute \(\sqrt[3]{x}\) (the cubic root of x).

pow(x, a)

Compute \(x^a\).

sqrt(x)

Compute \(\sqrt{x}\) (the square root of x).

square(x)

Compute \(x^2\) (the square of x).

Special mathemetical functions

erf(x)

The error function \(\operatorname{erf}(x)\).

erfc(x)

The complimentary error function \(1 - \operatorname{erf}(x)\).

gamma(x)

Euler gamma function of x.

lgamma(x)

Natual logarithm of the Euler gamma function of.

Floating-point functions

abs(x)

Absolute value of x.

ceil(x)

The smallest integer not less than x.

copysign(x, y)

Number with the magnitude of x and the sign of y.

fabs(x)

The absolute value of x, returned as a float.

floor(x)

The largest integer not greater than x.

fmod(x, y)

Remainder of a floating-point division x/y.

isclose(x, y)

Check whether x y (up to some tolerance level).

isfinite(x)

Check if x is finite.

isinf(x)

Check if x is a positive or negative infinity.

isna(x)

Check if x is a valid (not-NaN) value.

ldexp(x, y)

Compute \(x\cdot 2^y\).

rint(x)

Round x to the nearest integer.

sign(x)

The sign of x, as a floating-point value.

signbit(x)

The sign of x, as a boolean value.

trunc(x)

The value of x truncated towards zero.

Mathematical constants

e

Euler’s constant \(e\).

golden

Golden ratio \(\varphi\).

inf

Positive infinity.

nan

Not-a-number.

pi

Mathematical constant \(\pi\).

tau

Mathematical constant \(\tau\).

Comparison table

The set of functions provided by the dt.math module is very similar to the standard Python’s math module, or numpy math functions. Below is the comparison table showing which functions are available:

math

numpy

datatable

Trigonometric/hyperbolic functions

sin(x)

sin(x)

sin(x)

cos(x)

cos(x)

cos(x)

tan(x)

tan(x)

tan(x)

asin(x)

arcsin(x)

arcsin(x)

acos(x)

arccos(x)

arccos(x)

atan(x)

arctan(x)

arctan(x)

atan2(y, x)

arctan2(y, x)

atan2(y, x)

sinh(x)

sinh(x)

sinh(x)

cosh(x)

cosh(x)

cosh(x)

tanh(x)

tanh(x)

tanh(x)

asinh(x)

arcsinh(x)

arsinh(x)

acosh(x)

arccosh(x)

arcosh(x)

atanh(x)

arctanh(x)

artanh(x)

hypot(x, y)

hypot(x, y)

hypot(x, y)

radians(x)

deg2rad(x)

deg2rad(x)

degrees(x)

rad2deg(x)

rad2deg(x)

Exponential/logarithmic/power functions

exp(x)

exp(x)

exp(x)

exp2(x)

exp2(x)

expm1(x)

expm1(x)

expm1(x)

log(x)

log(x)

log(x)

log10(x)

log10(x)

log10(x)

log1p(x)

log1p(x)

log1p(x)

log2(x)

log2(x)

log2(x)

logaddexp(x, y)

logaddexp(x, y)

logaddexp2(x, y)

logaddexp2(x, y)

cbrt(x)

cbrt(x)

pow(x, a)

power(x, a)

pow(x, a)

sqrt(x)

sqrt(x)

sqrt(x)

square(x)

square(x)

Special mathematical functions

erf(x)

erf(x)

erfc(x)

erfc(x)

gamma(x)

gamma(x)

heaviside(x)

i0(x)

lgamma(x)

lgamma(x)

sinc(x)

Floating-point functions

abs(x)

abs(x)

abs(x)

ceil(x)

ceil(x)

ceil(x)

copysign(x, y)

copysign(x, y)

copysign(x, y)

fabs(x)

fabs(x)

fabs(x)

floor(x)

floor(x)

floor(x)

fmod(x, y)

fmod(x, y)

fmod(x)

frexp(x)

frexp(x)

isclose(x, y)

isclose(x, y)

isclose(x, y)

isfinite(x)

isfinite(x)

isfinite(x)

isinf(x)

isinf(x)

isinf(x)

isnan(x)

isnan(x)

isna(x)

ldexp(x, n)

ldexp(x, n)

ldexp(x, n)

modf(x)

modf(x)

nextafter(x, y)

rint(x)

rint(x)

round(x)

round(x)

round(x)

sign(x)

sign(x)

signbit(x)

signbit(x)

spacing(x)

trunc(x)

trunc(x)

trunc(x)

Miscellaneous

clip(x, a, b)

comb(n, k)

divmod(x, y)

factorial(n)

gcd(a, b)

gcd(a, b)

maximum(x, y)

minimum(x, y)

Mathematical constants

e

e

e

golden

inf

inf

inf

nan

nan

nan

pi

pi

pi

tau

tau

datatable.math.abs()

Return the absolute value of x. This function can only be applied to numeric arguments (i.e. boolean, integer, or real).

This function upcasts columns of types bool8, int8 and int16 into int32; for columns of other types the stype is kept.

Parameters
x
FExpr

Column expression producing one or more numeric columns.

return
FExpr

The resulting FExpr evaluates absolute values in all elements in all columns of x.

Examples
DT = dt.Frame(A=[-3, 2, 4, -17, 0]) DT[:, abs(f.A)]
A
int32
03
12
24
317
40
See also
datatable.math.arccos()

Inverse trigonometric cosine of x.

In mathematics, this may be written as \(\arccos x\) or \(\cos^{-1}x\).

The returned value is in the interval \([0, \frac12\tau]\), and NA for the values of x that lie outside the interval [-1, 1]. This function is the inverse of cos() in the sense that cos(arccos(x)) == x for all x in the interval [-1, 1].

See also
  • cos(x) – the trigonometric cosine function;

  • arcsin(x) – the inverse sine function.

datatable.math.arcosh()

The inverse hyperbolic cosine of x.

This function satisfies the property that cosh(arccosh(x)) == x. Alternatively, this function can also be computed as \(\cosh^{-1}(x) = \ln(x + \sqrt{x^2 - 1})\).

See also
datatable.math.arcsin()

Inverse trigonometric sine of x.

In mathematics, this may be written as \(\arcsin x\) or \(\sin^{-1}x\).

The returned value is in the interval \([-\frac14 \tau, \frac14\tau]\), and NA for the values of x that lie outside the interval [-1, 1]. This function is the inverse of sin() in the sense that sin(arcsin(x)) == x for all x in the interval [-1, 1].

See also
  • sin(x) – the trigonometric sine function;

  • arccos(x) – the inverse cosine function.

datatable.math.arctan()

Inverse trigonometric tangent of x.

This function satisfies the property that tan(arctan(x)) == x.

See also
  • atan2(x, y) – two-argument inverse tangent function;

  • tan(x) – the trigonometric tangent function.

datatable.math.arsinh()

The inverse hyperbolic sine of x.

This function satisfies the property that sinh(arcsinh(x)) == x. Alternatively, this function can also be computed as \(\sinh^{-1}(x) = \ln(x + \sqrt{x^2 + 1})\).

See also
datatable.math.artanh()

The inverse hyperbolic tangent of x.

This function satisfies the property that tanh(artanh(x)) == x. Alternatively, this function can also be computed as \(\tanh^{-1}(x) = \frac12\ln\frac{1+x}{1-x}\).

See also
  • tanh() – hyperbolic tangent;

datatable.math.atan2()

The inverse trigonometric tangent of y/x, taking into account the signs of x and y to produce the correct result.

If (x,y) is a point in a Cartesian plane, then arctan2(y, x) returns the radian measure of an angle formed by two rays: one starting at the origin and passing through point (0,1), and the other starting at the origin and passing through point (x,y). The angle is assumed positive if the rotation from the first ray to the second occurs counter-clockwise, and negative otherwise.

As a special case, arctan2(0, 0) == 0, and arctan2(0, -1) == tau/2.

See also
datatable.math.cbrt()

Cubic root of x.

datatable.math.ceil()

The smallest integer value not less than x, returned as float.

This function produces a float32 column if the input is of type float32, or float64 columns for inputs of all other numeric stypes.

Parameters
x
FExpr

One or more numeric columns.

return
FExpr

Expression that computes the ceil() function for each row and column in x.

datatable.math.copysign()

Return a float with the magnitude of x and the sign of y.

datatable.math.cos()

Compute the trigonometric cosine of angle x measured in radians.

This function can only be applied to numeric columns (real, integer, or boolean), and produces a float64 result, except when the argument x is float32, in which case the result is float32 as well.

See also
  • sin(x) – the trigonometric sine function;

  • arccos(x) – the inverse cosine function.

datatable.math.cosh()

The hyperbolic cosine of x, defined as \(\cosh x = \frac12(e^x + e^{-x})\).

See also
datatable.math.deg2rad()

Convert angle measured in degrees into radians: \(\operatorname{deg2rad}(x) = x\cdot\frac{\tau}{360}\).

See also
datatable.math.e

The base of the natural logarithm \(e\), also known as the Euler’s number. This number is defined as the limit \(e = \lim_{n\to\infty}(1 + 1/n)^n\).

The value is stored at float64 precision, and is equal to 2.718281828459045.

See Also
  • math.e – The Euler’s number in the Python math module;

datatable.math.erf()

Error function erf(x), which is defined as the integral

\[\operatorname{erf}(x) = \frac{2}{\sqrt{\tau}} \int^{x/\sqrt{2}}_0 e^{-\frac12 t^2}dt\]

This function is used in computing probabilities arising from the normal distribution.

See also
  • erfc(x) – complimentary error function.

datatable.math.erfc()

Complementary error function erfc(x) = 1 - erf(x).

The complementary error function is defined as the integral

\[\operatorname{erfc}(x) = \frac{2}{\sqrt{\tau}} \int^{\infty}_{x/\sqrt{2}} e^{-\frac12 t^2}dt\]

Although mathematically erfc(x) = 1-erf(x), in practice the RHS suffers catastrophic loss of precision at large values of x. This function, however, does not have such a drawback.

See also
  • erf(x) – the error function.

datatable.math.exp()

The exponent of x, that is \(e^x\).

See also
  • e – the Euler’s number;

  • expm1(x) – exponent function minus one;

  • exp2(x) – binary exponent;

datatable.math.exp2()

Binary exponent of x, same as \(2^x\).

See also
  • exp(x) – base-\(e\) exponent.

datatable.math.expm1()

The exponent of x minus 1, that is \(e^x - 1\). This function is more accurate for arguments x close to zero.

datatable.math.fabs()

The absolute value of x, returned as float.

datatable.math.floor()

The largest integer value not greater than x, returned as float.

This function produces a float32 column if the input is of type float32, or float64 columns for inputs of all other numeric stypes.

Parameters
x
FExpr

One or more numeric columns.

return
FExpr

Expression that computes the floor() function for each row and column in x.

datatable.math.fmod()

Floating-point remainder of the division x/y. The result is always a float, even if the arguments are integers. This function uses std::fmod() from the standard C++ library, its convention for handling of negative numbers may be different than the Python’s.

datatable.math.gamma()

Euler Gamma function of x.

The gamma function is defined for all x except for the negative integers. For positive x it can be computed via the integral

\[\Gamma(x) = \int_0^\infty t^{x-1}e^{-t}dt\]

For negative x it can be computed as

\[\Gamma(x) = \frac{\Gamma(x + k)}{x(x+1)\cdot...\cdot(x+k-1)}\]

where \(k\) is any integer such that \(x+k\) is positive.

If x is a positive integer, then \(\Gamma(x) = (x - 1)!\).

See also
datatable.math.golden

The golden ratio \(\varphi = (1 + \sqrt{5})/2\), also known as golden section. This is a number such that if \(a = \varphi b\), for some non-zero \(a\) and \(b\), then it must also be true that \(a + b = \varphi a\).

The constant is stored with float64 precision, and its value is 1.618033988749895.

datatable.math.hypot()

The length of the hypotenuse of a right triangle with sides x and y, or in math notation \(\operatorname{hypot}(x, y) = \sqrt{x^2 + y^2}\).

datatable.math.inf

Number representing positive infinity \(\infty\). Write -inf for negative infinity.

datatable.math.isclose()
isclose
(
x,
y,
*,
rtol=1e-5,
atol=1e-8
)

Compare two numbers x and y, and return True if they are close within the requested relative/absolute tolerance. This function only returns True/False, never NA.

More specifically, isclose(x, y) is True if either of the following are true:

  • x == y (including the case when x and y are NAs),

  • abs(x - y) <= atol + rtol * abs(y) and neither x nor y are NA

The tolerance parameters rtol, atol must be positive floats, and cannot be expressions.

datatable.math.isfinite()

Returns True if x has a finite value, and False if x is infinity or NaN. This function is equivalent to !(isna(x) or isinf(x)).

See also
datatable.math.isinf()

Returns True if the argument is +/- infinity, and False otherwise. Note that isinf(NA) == False.

datatable.math.isna()

Returns True if the argument is NA, and False otherwise.

datatable.math.ldexp()

Multiply x by 2 raised to the power y, i.e. compute x * 2**y. Column x is expected to be float, and y integer.

datatable.math.lgamma()

Natural logarithm of the absolute value of the Euler Gamma function of x.

datatable.math.log()

Natural logarithm of x, aka \(\ln x\). This function is the inverse of exp().

See also
datatable.math.log10()

Decimal (base-10) logarithm of x, which is \(\lg(x)\) or \(\log_{10} x\). This function is the inverse of pow(10, x).

See also
  • log() – natural logarithm;

  • log2() – binary logarithm.

datatable.math.log1p()

Natural logarithm of 1 plus x, or \(\ln(1 + x)\). This function has improved numeric precision for small values of x.

datatable.math.log2()

Binary (base-2) logarithm of x, which in mathematics is \(\log_2 x\).

See also
  • log() – natural logarithm;

  • log10() – decimal logarithm.

datatable.math.logaddexp()

The logarithm of the sum of exponents of x and y. This function is equivalent to log(exp(x) + exp(y)), but does not suffer from catastrophic precision loss for small values of x and y.

datatable.math.logaddexp2()

Binary logarithm of the sum of binary exponents of x and y. This function is equivalent to log2(exp2(x) + exp2(y)), but does not suffer from catastrophic precision loss for small values of x and y.

datatable.math.nan

Not-a-number, a special floating-point constant that denotes a missing number. In most datatable functions you can use None instead of nan.

datatable.math.pi

Mathematical constant \(\pi = \frac12\tau\), also known as Archimedes’ constant, equal to the length of a semicircle with radius 1, or equivalently the arc-length of a \(180^\circ\) angle [1].

The constant is stored at float64 precision, and its value is 3.141592653589793.

See Also
  • tau – mathematical constant \(\tau = 2\pi\);

  • math.pi – The \(\pi\) constant in the Python math module;

datatable.math.pow()

Number x raised to the power y. The return value will be float, even if the arguments x and y are integers.

This function is equivalent to x ** y.

datatable.math.rad2deg()

Convert angle measured in radians into degrees: \(\operatorname{rad2deg}(x) = x\cdot\frac{360}{\tau}\).

See also
datatable.math.rint()

Round the value x to the nearest integer.

datatable.math.round()
Added in version 0.11

Round the values in cols up to the specified number of the digits of precision ndigits. If the number of digits is omitted, rounds to the nearest integer.

Generally, this operation is equivalent to:

rint(col * 10**ndigits) / 10**ndigits

where function rint() rounds to the nearest integer.

Parameters
cols
FExpr

Input data for rounding. This could be an expression yielding either a single or multiple columns. The round() function will apply to each column independently and produce as many columns in the output as there were in the input.

Only numeric columns are allowed: boolean, integer or float. An exception will be raised if cols contains a non-numeric column.

ndigits
int | None

The number of precision digits to retain. This parameter could be either positive or negative (or None). If positive then it gives the number of digits after the decimal point. If negative, then it rounds the result up to the corresponding power of 10.

For example, 123.45 rounded to ndigits=1 is 123.4, whereas rounded to ndigits=-1 it becomes 120.0.

return
FExpr

f-expression that rounds the values in its first argument to the specified number of precision digits.

Each input column will produce the column of the same stype in the output; except for the case when ndigits is None and the input is either float32 or float64, in which case an int64 column is produced (similarly to python’s round()).

Notes

Values that are exactly half way in between their rounded neighbors are converted towards their nearest even value. For example, both 7.5 and 8.5 are rounded into 8, whereas 6.5 is rounded as 6.

Rounding integer columns may produce unexpected results for values that are close to the min/max value of that column’s storage type. For example, when an int8 value 127 is rounded to nearest 10, it becomes 130. However, since 130 cannot be represented as int8 a wrap-around occurs and the result becomes -126.

Rounding an integer column to a positive ndigits is a noop: the column will be returned unchanged.

Rounding an integer column to a large negative ndigits will produce a constant all-0 column.

datatable.math.sign()

The sign of x, returned as float.

This function returns 1.0 if x is positive (including positive infinity), -1.0 if x is negative, 0.0 if x is zero, and NA if x is NA.

datatable.math.signbit()

Returns True if x is negative (its sign bit is set), and False if x is positive. This function is able to distinguish between -0.0 and +0.0, returning True/False respectively. If x is an NA value, this function will also return NA.

datatable.math.sin()

Compute the trigonometric sine of angle x measured in radians.

This function can only be applied to numeric columns (real, integer, or boolean), and produces a float64 result, except when the argument x is float32, in which case the result is float32 as well.

See also
  • cos(x) – the trigonometric cosine function;

  • arcsin(x) – the inverse sine function.

datatable.math.sinh()

Hyperbolic sine of x, defined as \(\sinh x = \frac12(e^x - e^{-x})\).

See also
datatable.math.sqrt()

The square root of x, same as x ** 0.5.

datatable.math.square()

The square of x, same as x ** 2.0. As with all other math functions, the result is floating-point, even if the argument x is integer.

datatable.math.tan()

Compute the trigonometric tangent of x, which is the ratio sin(x)/cos(x).

This function can only be applied to numeric columns (real, integer, or boolean), and produces a float64 result, except when the argument x is float32, in which case the result is float32 as well.

See also
  • arctan(x) – the inverse tangent function.

datatable.math.tanh()

Hyperbolic tangent of x, defined as \(\tanh x = \frac{\sinh x}{\cosh x} = \frac{e^x-e^{-x}}{e^x+e^{-x}}\).

See also
  • artanh() – inverse hyperbolic tangent.

datatable.math.tau

Mathematical constant \(\tau\), also known as a turn, equal to the circumference of a circle with a unit radius.

The constant is stored at float64 precision, and its value is 6.283185307179586.

See Also
  • pi – mathematical constant \(\pi = \frac12\tau\);

  • math.tau – The \(\tau\) constant in the Python math module;

  • Tau manifesto

datatable.math.trunc()

The nearest integer value not greater than x in magnitude.

If x is integer or boolean, then trunc() will return this value converted to float64. If x is floating-point, then trunc(x) acts as floor(x) for positive values of x, and as ceil(x) for negative values of x. This rounding mode is known as rounding towards zero.

datatable.models

Classes

Ftrl

FTRL-Proximal online learning model.

LinearModel

Linear model with stohastic gradient descent learning.

Functions

aggregate()

Aggregate a frame.

kfold()

Perform k-fold split.

kfold_random()

Perform randomized k-fold split.

datatable.models.Ftrl

This class implements the Follow the Regularized Leader (FTRL) model, that is based on the FTRL-Proximal online learning algorithm for binomial logistic regression. Multinomial classification and regression for continuous targets are also implemented, though these implementations are experimental. This model is fully parallel and is based on the Hogwild approach for parallelization.

The model supports numerical (boolean, integer and float types), temporal (date and time types) and string features. To vectorize features a hashing trick is employed, such that all the values are hashed with the 64-bit hashing function. This function is implemented as follows:

  • for booleans and integers the hashing function is essentially an identity function;

  • for floats the hashing function trims mantissa, taking into account mantissa_nbits, and interprets the resulting bit representation as a 64-bit unsigned integer;

  • for date and time types the hashing function is essentially an identity function that is based on their internal integer representations;

  • for strings the 64-bit Murmur2 hashing function is used.

To compute the final hash x the Murmur2 hashed feature name is added to the hashed feature and the result is modulo divided by the number of requested bins, i.e. by nbins.

For each hashed row of data, according to Ad Click Prediction: a View from the Trenches, the following FTRL-Proximal algorithm is employed:

Per-coordinate FTRL-Proximal online learning algorithm

When trained, the model can be used to make predictions, or it can be re-trained on new datasets as many times as needed improving model weights from run to run.

Construction

Ftrl()

Construct an Ftrl object.

Methods

fit()

Train model on the input samples and targets.

predict()

Predict for the input samples.

reset()

Reset the model.

Properties

alpha

\(\alpha\) in per-coordinate FTRL-Proximal algorithm.

beta

\(\beta\) in per-coordinate FTRL-Proximal algorithm.

colnames

Column names of the training frame, i.e. features.

colname_hashes

Hashes of the column names.

double_precision

An option to control precision of the internal computations.

feature_importances

Feature importances calculated during training.

interactions

Feature interactions.

labels

Classification labels.

lambda1

L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm.

lambda2

L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm.

mantissa_nbits

Number of mantissa bits for hashing floats.

model

The model’s z and n coefficients.

model_type

A model type Ftrl should build.

model_type_trained

A model type Ftrl has built.

nbins

Number of bins for the hashing trick.

negative_class

An option to indicate if the “negative” class should be a created for multinomial classification.

nepochs

Number of training epochs.

params

All the input model parameters as a named tuple.

datatable.models.Ftrl.__init__()

Create a new Ftrl object.

Parameters
alpha
float

\(\alpha\) in per-coordinate FTRL-Proximal algorithm, should be positive.

beta
float

\(\beta\) in per-coordinate FTRL-Proximal algorithm, should be non-negative.

lambda1
float

L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.

lambda2
float

L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.

nbins
int

Number of bins to be used for the hashing trick, should be positive.

mantissa_nbits
int

Number of mantissa bits to take into account when hashing floats. It should be non-negative and less than or equal to 52, that is a number of mantissa bits allocated for a C++ 64-bit double.

nepochs
float

Number of training epochs, should be non-negative. When nepochs is an integer number, the model will train on all the data provided to .fit() method nepochs times. If nepochs has a fractional part {nepochs}, the model will train on all the data [nepochs] times, i.e. the integer part of nepochs. Plus, it will also perform an additional training iteration on the {nepochs} fraction of data.

double_precision
bool

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations. It is not guaranteed that setting double_precision to True will automatically improve the model accuracy. It will, however, roughly double the memory footprint of the Ftrl object.

negative_class
bool

An option to indicate if a “negative” class should be created in the case of multinomial classification. For the “negative” class the model will train on all the negatives, and if a new label is encountered in the target column, its weights will be initialized to the current “negative” class weights. If negative_class is set to False, the initial weights become zeros.

interactions
List[List[str] | Tuple[str]] | Tuple[List[str] | Tuple[str]]

A list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. Each interaction should have at least one feature.

model_type
"binomial" | "multinomial" | "regression" | "auto"

The model type to be built. When this option is "auto" then the model type will be automatically chosen based on the target column stype.

params
FtrlParams

Named tuple of the above parameters. One can pass either this tuple, or any combination of the individual parameters to the constructor, but not both at the same time.

except
ValueError

The exception is raised if both the params and one of the individual model parameters are passed at the same time.

datatable.models.Ftrl.alpha

\(\alpha\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current alpha value.

new_alpha
float

New alpha value, should be positive.

except
ValueError

The exception is raised when new_alpha is not positive.

datatable.models.Ftrl.beta

\(\beta\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current beta value.

new_beta
float

New beta value, should be non-negative.

except
ValueError

The exception is raised when new_beta is negative.

datatable.models.Ftrl.colnames

Column names of the training frame, i.e. the feature names.

Parameters
return
List[str]

A list of the column names.

See also
datatable.models.Ftrl.colname_hashes

Hashes of the column names used for the hashing trick as described in the Ftrl class description.

Parameters
return
List[int]

A list of the column name hashes.

See also
  • .colnames – the column names of the training frame, i.e. the feature names.

datatable.models.Ftrl.double_precision

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations. This option is read-only and can only be set during the Ftrl object construction.

Parameters
return
bool

Current double_precision value.

datatable.models.Ftrl.feature_importances
feature_importances

Feature importances as calculated during the model training and normalized to [0; 1]. The normalization is done by dividing the accumulated feature importances over the maximum value.

Parameters
return
Frame

A frame with two columns: feature_name that has stype str32, and feature_importance that has stype float32 or float64 depending on whether the .double_precision option is False or True.

datatable.models.Ftrl.fit()

Train model on the input samples and targets.

Parameters
X_train
Frame

Training frame.

y_train
Frame

Target frame having as many rows as X_train and one column.

X_validation
Frame

Validation frame having the same number of columns as X_train.

y_validation
Frame

Validation target frame of shape (nrows, 1).

nepochs_validation
float

Parameter that specifies how often, in epoch units, validation error should be checked.

validation_error
float

The improvement of the relative validation error that should be demonstrated by the model within nepochs_validation epochs, otherwise the training will stop.

validation_average_niterations
int

Number of iterations that is used to average the validation error. Each iteration corresponds to nepochs_validation epochs.

return
FtrlFitOutput

FtrlFitOutput is a Tuple[float, float] with two fields: epoch and loss, representing the final fitting epoch and the final loss, respectively. If validation dataset is not provided, the returned epoch equals to nepochs and the loss is just float('nan').

See also
datatable.models.Ftrl.interactions

The feature interactions to be used for model training. This option is read-only for a trained model.

Parameters
return
Tuple

Current interactions value.

new_interactions
List[List[str] | Tuple[str]] | Tuple[List[str] | Tuple[str]]

New interactions value. Each particular interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • one of the interactions has zero features.

datatable.models.Ftrl.labels

Classification labels the model was trained on.

Parameters
return
Frame

A one-column frame with the classification labels. In the case of numeric regression, the label is the target column name.

datatable.models.Ftrl.lambda1

L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current lambda1 value.

new_lambda1
float

New lambda1 value, should be non-negative.

except
ValueError

The exception is raised when new_lambda1 is negative.

datatable.models.Ftrl.lambda2

L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current lambda2 value.

new_lambda2
float

New lambda2 value, should be non-negative.

except
ValueError

The exception is raised when new_lambda2 is negative.

datatable.models.Ftrl.model

Trained models weights, i.e. z and n coefficients in per-coordinate FTRL-Proximal algorithm.

Parameters
return
Frame

A frame of shape (nbins, 2 * nlabels), where nlabels is the total number of labels the model was trained on, and nbins is the number of bins used for the hashing trick. Odd and even columns represent the z and n model coefficients, respectively.

datatable.models.Ftrl.model_type

A type of the model Ftrl should build:

  • "binomial" for binomial classification;

  • "multinomial" for multinomial classification;

  • "regression" for numeric regression;

  • "auto" for automatic model type detection based on the target column stype.

This option is read-only for a trained model.

Parameters
return
str

Current model_type value.

new_model_type
"binomial" | "multinomial" | "regression" | "auto"

New model_type value.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • new_model_type value is not one of the following: "binomial", "multinomial", "regression" or "auto".

See also
datatable.models.Ftrl.model_type_trained

The model type Ftrl has built.

Parameters
return
str

Could be one of the following: "regression", "binomial", "multinomial" or "none" for untrained model.

See also
datatable.models.Ftrl.mantissa_nbits

Number of mantissa bits to take into account for hashing floats. This option is read-only for a trained model.

Parameters
return
int

Current mantissa_nbits value.

new_mantissa_nbits
int

New mantissa_nbits value, should be non-negative and less than or equal to 52, that is a number of mantissa bits in a C++ 64-bit double.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • new_mantissa_nbits value is negative or larger than 52.

datatable.models.Ftrl.nbins

Number of bins to be used for the hashing trick. This option is read-only for a trained model.

Parameters
return
int

Current nbins value.

new_nbins
int

New nbins value, should be positive.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • new_nbins value is not positive.

datatable.models.Ftrl.negative_class

An option to indicate if a “negative” class should be created in the case of multinomial classification. For the “negative” class the model will train on all the negatives, and if a new label is encountered in the target column, its weights are initialized to the current “negative” class weights. If negative_class is set to False, the initial weights become zeros.

This option is read-only for a trained model.

Parameters
return
bool

Current negative_class value.

new_negative_class
bool

New negative_class value.

except
ValueError

The exception is raised when trying to change this option for a model that has already been trained.

datatable.models.Ftrl.nepochs

Number of training epochs. When nepochs is an integer number, the model will train on all the data provided to .fit() method nepochs times. If nepochs has a fractional part {nepochs}, the model will train on all the data [nepochs] times, i.e. the integer part of nepochs. Plus, it will also perform an additional training iteration on the {nepochs} fraction of data.

Parameters
return
float

Current nepochs value.

new_nepochs
float

New nepochs value, should be non-negative.

except
ValueError

The exception is raised when new_nepochs value is negative.

datatable.models.Ftrl.params

Ftrl model parameters as a named tuple FtrlParams, see .__init__() for more details. This option is read-only for a trained model.

Parameters
return
FtrlParams

Current params value.

new_params
FtrlParams

New params value.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • individual parameter values are incompatible with the corresponding setters.

datatable.models.Ftrl.predict()

Predict for the input samples.

Parameters
X
Frame

A frame to make predictions for. It should have the same number of columns as the training frame.

return
Frame

A new frame of shape (X.nrows, nlabels) with the predicted probabilities for each row of frame X and each of nlabels labels the model was trained for.

See also
  • .fit() – train model on the input samples and targets.

  • .reset() – reset the model.

datatable.models.Ftrl.reset()

Reset Ftrl model by resetting all the model weights, labels and feature importance information.

Parameters
return
None
See also
datatable.models.LinearModel

This class implements the Linear model with the stochastic gradient descent learning. It supports linear regression, as well as binomial and multinomial classification. Both .fit() and .predict() methods are fully parallel.

Construction

LinearModel()

Construct a LinearModel object.

Methods

fit()

Train model on the input samples and targets.

is_fitted()

Report model status.

predict()

Predict for the input samples.

reset()

Reset the model.

Properties

eta0

Initial learning rate.

eta_decay

Decay for the "time-based" and "step-based" learning rate schedules.

eta_drop_rate

Drop rate for the "step-based" learning rate schedule.

eta_schedule

Learning rate schedule.

double_precision

An option to control precision of the internal computations.

labels

Classification labels.

lambda1

L1 regularization parameter.

lambda2

L2 regularization parameter.

model

Model coefficients.

model_type

Model type to be built.

negative_class

An option to indicate if the “negative” class should be a created for multinomial classification.

nepochs

Number of training epochs.

params

All the input model parameters as a named tuple.

seed

Seed for the quasi-random data shuffling.

datatable.models.LinearModel.__init__()
LinearModel
(
eta0=0.005,
eta_decay=0.0001,
eta_schedule='constant',
model_type='auto',
seed=0,
params=None
)

Create a new LinearModel object.

Parameters
eta0
float

The initial learning rate, should be positive.

eta_decay
float

Decay for the "time-based" and "step-based" learning rate schedules, should be non-negative.

eta_drop_rate
float

Drop rate for the "step-based" learning rate schedule, should be positive.

eta_schedule
"constant" | "time-based" | "step-based" | "exponential"

Learning rate schedule. When it is "constant" the learning rate eta is constant and equals to eta0. Otherwise, after each training iteration eta is updated as follows:

  • for "time-based" schedule as eta0 / (1 + eta_decay * epoch);

  • for "step-based" schedule as eta0 * eta_decay ^ floor((1 + epoch) / eta_drop_rate);

  • for "exponential" schedule as eta0 / exp(eta_decay * epoch).

By default, the size of the training iteration is one epoch, it becomes nepochs_validation when validation dataset is specified.

lambda1
float

L1 regularization parameter, should be non-negative.

lambda2
float

L2 regularization parameter, should be non-negative.

nepochs
float

Number of training epochs, should be non-negative. When nepochs is an integer number, the model will train on all the data provided to .fit() method nepochs times. If nepochs has a fractional part {nepochs}, the model will train on all the data [nepochs] times, i.e. the integer part of nepochs. Plus, it will also perform an additional training iteration on the {nepochs} fraction of data.

double_precision
bool

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations. It is not guaranteed that setting double_precision to True will automatically improve the model accuracy. It will, however, roughly double the memory footprint of the LinearModel object.

negative_class
bool

An option to indicate if a “negative” class should be created in the case of multinomial classification. For the “negative” class the model will train on all the negatives, and if a new label is encountered in the target column, its coefficients will be initialized to the current “negative” class coefficients. If negative_class is set to False, the initial coefficients become zeros.

model_type
"binomial" | "multinomial" | "regression" | "auto"

The model type to be built. When this option is "auto" then the model type will be automatically chosen based on the target column stype.

seed
int

Seed for the quasi-random number generator that is used for data shuffling when fitting the model, should be non-negative. If seed is zero, no shuffling is performed.

params
LinearModelParams

Named tuple of the above parameters. One can pass either this tuple, or any combination of the individual parameters to the constructor, but not both at the same time.

except
ValueError

The exception is raised if both the params and one of the individual model parameters are passed at the same time.

datatable.models.LinearModel.eta0

Learning rate.

Parameters
return
float

Current eta0 value.

new_eta0
float

New eta0 value, should be positive.

except
ValueError

The exception is raised when new_eta0 is not positive.

datatable.models.LinearModel.eta_decay

Decay for the "time-based" and "step-based" learning rate schedules.

Parameters
return
float

Current eta_decay value.

new_eta_decay
float

New eta_decay value, should be non-negative.

except
ValueError

The exception is raised when new_eta_decay is negative.

datatable.models.LinearModel.eta_drop_rate

Drop rate for the "step-based" learning rate schedule.

Parameters
return
float

Current eta_drop_rate value.

new_eta_drop_rate
float

New eta_drop_rate value, should be positive.

except
ValueError

The exception is raised when new_eta_drop_rate is not positive.

datatable.models.LinearModel.eta_schedule

Learning rate schedule

  • "constant" for constant eta;

  • "time-based" for time-based schedule;

  • "step-based" for step-based schedule;

  • "exponential" for exponential schedule.

Parameters
return
str

Current eta_schedule value.

new_eta_schedule
"constant" | "time-based" | "step-based" | "exponential"

New eta_schedule value.

datatable.models.LinearModel.double_precision

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations. This option is read-only and can only be set during the LinearModel object construction.

Parameters
return
bool

Current double_precision value.

datatable.models.LinearModel.fit()

Train model on the input samples and targets using the parallel stochastic gradient descent method.

Parameters
X_train
Frame

Training frame.

y_train
Frame

Target frame having as many rows as X_train and one column.

X_validation
Frame

Validation frame having the same number of columns as X_train.

y_validation
Frame

Validation target frame of shape (nrows, 1).

nepochs_validation
float

Parameter that specifies how often, in epoch units, validation error should be checked.

validation_error
float

The improvement of the relative validation error that should be demonstrated by the model within nepochs_validation epochs, otherwise the training will stop.

validation_average_niterations
int

Number of iterations that is used to average the validation error. Each iteration corresponds to nepochs_validation epochs.

return
LinearModelFitOutput

LinearModelFitOutput is a Tuple[float, float] with two fields: epoch and loss, representing the final fitting epoch and the final loss, respectively. If validation dataset is not provided, the returned epoch equals to nepochs and the loss is just float('nan').

See also
datatable.models.LinearModel.is_fitted()

Report model status.

Parameters
return
bool

True if model is trained, False otherwise.

datatable.models.LinearModel.labels

Classification labels the model was trained on.

Parameters
return
Frame

A one-column frame with the classification labels. In the case of numeric regression, the label is the target column name.

datatable.models.LinearModel.lambda1

L1 regularization parameter.

Parameters
return
float

Current lambda1 value.

new_lambda1
float

New lambda1 value, should be non-negative.

except
ValueError

The exception is raised when new_lambda1 is negative.

datatable.models.LinearModel.lambda2

L2 regularization parameter.

Parameters
return
float

Current lambda2 value.

new_lambda2
float

New lambda2 value, should be non-negative.

except
ValueError

The exception is raised when new_lambda2 is negative.

datatable.models.LinearModel.model

Trained models coefficients.

Parameters
return
Frame

A frame of shape (nfeatures + 1, nlabels), where nlabels is the number of labels the model was trained on, and nfeatures is the number of features. Each column contains model coefficients for the corresponding label: starting from the intercept and following by the coefficients for each of of the nfeatures features.

datatable.models.LinearModel.model_type

A type of the model LinearModel should build:

  • "binomial" for binomial classification;

  • "multinomial" for multinomial classification;

  • "regression" for numeric regression;

  • "auto" for automatic model type detection based on the target column stype.

This option is read-only for a trained model.

Parameters
return
str

Current model_type value.

new_model_type
"binomial" | "multinomial" | "regression" | "auto"

New model_type value.

except
ValueError

The exception is raised when trying to change this option for a model that has already been trained.

datatable.models.LinearModel.negative_class

An option to indicate if a “negative” class should be created in the case of multinomial classification. For the “negative” class the model will train on all the negatives, and if a new label is encountered in the target column, its coefficients are initialized to the current “negative” class coefficients. If negative_class is set to False, the initial coefficients become zeros.

This option is read-only for a trained model.

Parameters
return
bool

Current negative_class value.

new_negative_class
bool

New negative_class value.

except
ValueError

The exception is raised when trying to change this option for a model that has already been trained.

datatable.models.LinearModel.nepochs

Number of training epochs. When nepochs is an integer number, the model will train on all the data provided to .fit() method nepochs times. If nepochs has a fractional part {nepochs}, the model will train on all the data [nepochs] times, i.e. the integer part of nepochs. Plus, it will also perform an additional training iteration on the {nepochs} fraction of data.

Parameters
return
float

Current nepochs value.

new_nepochs
float

New nepochs value, should be non-negative.

except
ValueError

The exception is raised when new_nepochs value is negative.

datatable.models.LinearModel.params

LinearModel model parameters as a named tuple LinearModelParams, see .__init__() for more details. This option is read-only for a trained model.

Parameters
return
LinearModelParams

Current params value.

new_params
LinearModelParams

New params value.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • individual parameter values are incompatible with the corresponding setters.

datatable.models.LinearModel.predict()

Predict for the input samples.

Parameters
X
Frame

A frame to make predictions for. It should have the same number of columns as the training frame.

return
Frame

A new frame of shape (X.nrows, nlabels) with the predicted probabilities for each row of frame X and each of nlabels labels the model was trained for.

See also
  • .fit() – train model on the input samples and targets.

  • .reset() – reset the model.

datatable.models.LinearModel.reset()

Reset linear model by resetting all the model coefficients and labels.

Parameters
return
None
See also
datatable.models.LinearModel.seed

Seed for the quasi-random number generator that is used for data shuffling when fitting the model. If seed is 0, no shuffling is performed.

Parameters
return
int

Current seed value.

new_seed
int

New seed value, should be non-negative.

datatable.models.aggregate()

Aggregate a frame into clusters. Each cluster consists of a set of members, i.e. a subset of the input frame, and is represented by an exemplar, i.e. one of the members.

For one- and two-column frames the aggregation is based on the standard equal-interval binning for numeric columns and grouping operation for string columns.

In the general case, a parallel one-pass ad hoc algorithm is employed. It starts with an empty exemplar list and does one pass through the data. If a partucular observation falls into a bubble with a given radius and the center being one of the exemplars, it marks this observation as a member of that exemplar’s cluster. If there is no appropriate exemplar found, the observation is marked as a new exemplar.

If the fixed_radius is None, the algorithm will start with the delta, that is radius squared, being equal to the machine precision. When the number of gathered exemplars becomes larger than nd_max_bins, the following procedure is performed:

  • find the mean distance between all the gathered exemplars;

  • merge all the exemplars that are within the half of this distance;

  • adjust delta by taking into account the initial bubble radius;

  • save the exemplar’s merging information for the final processing.

If the fixed_radius is set to a valid numeric value, the algorithm will stick to that value and will not adjust delta.

Note: the general n-dimensional algorithm takes into account the numeric columns only, and all the other columns are ignored.

Parameters
frame
Frame

The input frame containing numeric or string columns.

min_rows
int

Minimum number of rows the input frame should have to be aggregated. If frame has less rows than min_rows, aggregation is bypassed, in the sence that all the input rows become exemplars.

n_bins
int

Number of bins for 1D aggregation.

nx_bins
int

Number of bins for the first column for 2D aggregation.

ny_bins
int

Number of bins for the second column for 2D aggregation.

nd_max_bins
int

Maximum number of exemplars for ND aggregation. It is guaranteed that the ND algorithm will return less than nd_max_bins exemplars, but the exact number may vary from run to run due to parallelization.

max_dimensions
int

Number of columns at which the projection method is used for ND aggregation.

seed
int

Seed to be used for the projection method.

double_precision
bool

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations.

fixed_radius
float

Fixed radius for ND aggregation, use it with caution. If set, nd_max_bins will have no effect and in the worst case number of exemplars may be equal to the number of rows in the data. For big data this may result in extremly large execution times. Since all the columns are normalized to [0, 1), the fixed_radius value should be chosen accordingly.

return
Tuple[Frame, Frame]

The first element in the tuple is the aggregated frame, i.e. the frame containing exemplars, with the shape of (nexemplars, frame.ncols + 1), where nexemplars is the number of gathered exemplars. The first frame.ncols columns are the columns from the input frame, and the last column is the members_count that has stype int32 containing number of members per exemplar.

The second element in the tuple is the members frame with the shape of (frame.nrows, 1). Each row in this frame corresponds to the row with the same id in the input frame. The single column exemplar_id has an stype of int32 and contains the exemplar ids that a particular member belongs to. These ids are effectively the ids of the exemplar’s rows from the input frame.

except
TypeError

The exception is raised when one of the frame’s columns has an unsupported stype, i.e. there is a column that is both non-numeric and non-string.

datatable.models.kfold()

Perform k-fold split of data with nrows rows into nsplits train/test subsets. The dataset itself is not passed to this function: it is sufficient to know only the number of rows in order to decide how the data should be split.

The range [0; nrows) is split into nsplits approximately equal parts, i.e. folds, and then each i-th split will use the i-th fold as a test part, and all the remaining rows as the train part. Thus, i-th split is comprised of:

  • train rows: [0; i*nrows/nsplits) + [(i+1)*nrows/nsplits; nrows);

  • test rows: [i*nrows/nsplits; (i+1)*nrows/nsplits).

where integer division is assumed.

Parameters
nrows
int

The number of rows in the frame that is going to be split.

nsplits
int

Number of folds, must be at least 2, but not larger than nrows.

return
List[Tuple]

This function returns a list of nsplits tuples (train_rows, test_rows), where each component of the tuple is a rows selector that can be applied to any frame with nrows rows to select the desired folds. Some of these row selectors will be simple python ranges, others will be single-column Frame objects.

See Also

kfold_random() – Perform randomized k-fold split.

datatable.models.kfold_random()

Perform randomized k-fold split of data with nrows rows into nsplits train/test subsets. The dataset itself is not passed to this function: it is sufficient to know only the number of rows in order to decide how the data should be split.

The train/test subsets produced by this function will have the following properties:

  • all test folds will be of approximately the same size nrows/nsplits;

  • all observations have equal ex-ante chance of getting assigned into each fold;

  • the row indices in all train and test folds will be sorted.

The function uses single-pass parallelized algorithm to construct the folds.

Parameters
nrows
int

The number of rows in the frame that you want to split.

nsplits
int

Number of folds, must be at least 2, but not larger than nrows.

seed
int

Seed value for the random number generator used by this function. Calling the function several times with the same seed values will produce same results each time.

return
List[Tuple]

This function returns a list of nsplits tuples (train_rows, test_rows), where each component of the tuple is a rows selector that can be applied to to any frame with nrows rows to select the desired folds.

See Also

kfold() – Perform k-fold split.

datatable.options

class
options

Repository of datatable configuration options. This namespace contains the following option groups:

.debug

Debug options.

.display

Display options.

.frame

Frame-related options.

.fread

fread()-related options.

.progress

Progress reporting options.

It also contains the following individual options:

.nthreads

Number of threads used by datatable for parallel computations.

datatable.options.debug

This namespace contains the following debug options:

.arg_max_size

The number of characters to use per a function/method argument.

.enabled

Option that enables logging of the debug information.

.logger

The custom logger object.

.report_args

Option that enables logging of the function/method arguments.

datatable.options.debug.arg_max_size

This option limits the display size of each argument in order to prevent potentially huge outputs. It has no effect, if debug.report_args is False.

Parameters
return
int

Current arg_max_size value. Initially, this option is set to 100.

new_arg_max_size
int

New arg_max_size value, should be non-negative. If new_arg_max_size < 10, then arg_max_size will be set to 10.

except
TypeError

The exception is raised when new_arg_max_size is negative.

datatable.options.debug.enabled

This option controls whether or not all the calls to the datatable core functions should be logged.

Parameters
return
bool

Current enabled value. Initially, this option is set to False.

new_enabled
bool

New enabled value. If set to True, all the calls to the datatable core functions will be logged along with their respective timings.

datatable.options.debug.logger

The logger object used for reporting calls to datatable core functions. This option has no effect if debug.enabled is False.

Parameters
return
object

Current logger value. Initially, this option is set to None, meaning that the built-in logger should be used.

new_logger
object

New logger value.

except
TypeError

The exception is raised when new_logger is not an object having a method .debug(self, msg).

datatable.options.debug.report_args

This option controls whether log messages for the function and method calls should contain information about the arguments of those calls.

Parameters
return
bool

Current report_args value. Initially, this option is set to False.

new_report_args
object

New report_args value.

datatable.options.display

class
display

This namespace contains the following display options:

.allow_unicode

Option that controls if the unicode characters are allowed.

.head_nrows

The number of top rows to display when the frame view is truncated.

.interactive

Option that controls if the interactive view is enabled or not.

.max_column_width

The threshold for the column’s width to be truncated.

.max_nrows

The threshold for the number of rows in a frame to be truncated.

.tail_nrows

The number of bottom rows to display when the frame view is truncated.

.use_colors

Option that controls if colors should be used in the console.

datatable.options.display.allow_unicode

This option controls whether or not unicode characters are allowed in the datatable output.

Parameters
return
bool

Current allow_unicode value. Initially, this option is set to True.

new_allow_unicode
bool

New allow_unicode value. If True, datatable will allow unicode characters (encoded as UTF-8) to be printed into the output. If False, then unicode characters will either be avoided, or hex-escaped as necessary.

datatable.options.display.head_nrows

This option controls the number of rows from the top of a frame to be displayed when the frame’s output is truncated due to the total number of rows exceeding display.max_nrows value.

Parameters
return
int

Current head_nrows value. Initially, this option is set to 15.

new_head_nrows
int

New head_nrows value, should be non-negative.

except
ValueError

The exception is raised when the new_head_nrows is negative.

datatable.options.display.interactive

Warning: This option is currently not working properly [#2669]

This option controls the behavior of a Frame when it is viewed in a text console. To enter the interactive mode manually, one can still call the Frame.view() method.

Parameters
return
bool

Current interactive value. Initially, this option is set to False.

new_interactive
bool

New interactive value. If True, frames will be shown in the interactove mode, allowing you to navigate the rows/columns with the keyboard. If False, frames will be shown in regular, non-interactive mode.

datatable.options.display.max_column_width

This option controls the threshold for the column’s width to be truncated. If a column’s name or its values exceed the max_column_width, the content of the column is truncated to max_column_width characters when printed.

This option applies to both the rendering of a frame in a terminal, and the rendering in a Jupyter notebook.

Parameters
return
int

Current max_column_width value. Initially, this option is set to 100.

new_max_column_width
int

New max_column_width value, cannot be less than 2. If new_max_column_width equals to None, the column’s content would never be truncated.

except
ValueError

The exception is raised when the new_max_column_width is less than 2.

datatable.options.display.max_nrows

This option controls the threshold for the number of rows in a frame to be truncated when printed to the console.

If a frame has more rows than max_nrows, it will be displayed truncated: only its first head_nrows and last tail_nrows rows will be printed. Otherwise, no truncation will occur. It is recommended to have head_nrows + tail_nrows <= max_nrows.

Parameters
return
int

Current max_nrows value. Initially, this option is set to 30.

new_max_nrows
int

New max_nrows value. If this option is set to None or to a negative value, no frame truncation will occur when printed, which may cause the console to become unresponsive for frames with large number of rows.

datatable.options.display.tail_nrows

This option controls the number of rows from the bottom of a frame to be displayed when the frame’s output is truncated due to the total number of rows exceeding max_nrows value.

Parameters
return
int

Current tail_nrows value. Initially, this option is set to 5.

new_tail_nrows
int

New tail_nrows value, should be non-negative.

except
ValueError

The exception is raised when the new_tail_nrows is negative.

datatable.options.display.use_colors

This option controls whether or not to use colors when printing datatable messages into the console. Turn this off if your terminal is unable to display ANSI escape sequences, or if the colors make output not legible.

Parameters
return
bool

Current use_colors value. Initially, this option is set to True.

new_use_colors
bool

New use_colors value.

datatable.options.frame

This namespace contains the following Frame options:

.names_auto_index

Initial value of the default column name index.

.names_auto_prefix

Default column name prefix.

datatable.options.frame.names_auto_index

This option controls the starting index that is used for auto-naming the columns. By default, the names that datatable assigns to frame’s columns are C0, C1, C2, etc. Setting names_auto_index, for instance, to 1 will cause the columns to be named as C1, C2, C3, etc.

Parameters
return
int

Current names_auto_index value. Initially, this option is set to 0.

new_names_auto_index
int

New names_auto_index value.

See Also
datatable.options.frame.names_auto_prefix

This option controls the prefix that is used for auto-naming the columns. By default, the names that datatable assigns to frame’s columns are C0, C1, C2, etc. Setting names_auto_prefix, for instance, to Z will cause the columns to be named as Z1, Z2, Z3, etc.

Parameters
return
str

Current names_auto_prefix value. Initially, this option is set to C.

new_names_auto_prefix
str

New names_auto_prefix value.

See Also

datatable.options.fread

This namespace contains the following fread option groups:

.log

Logging-related options.

datatable.options.fread.log

This property controls the following logging options:

.anonymize

Option that controls logs anonymization.

.escape_unicode

Option that controls escaping of the unicode characters.

datatable.options.fread.log.anonymize

This option controls logs anonymization that is useful in production systems, when reading sensitive data that must not accidentally leak into log files or be printed with the error messages.

Parameters
return
bool

Current anonymize value. Initially, this option is set to False.

new_anonymize
bool

New anonymize value. If True, any snippets of data being read that are printed in the log will be first anonymized by converting all non-zero digits to 1, all lowercase letters to a, all uppercase letters to A, and all unicode characters to U. If False, no data anonymization will be performed.

datatable.options.fread.log.escape_unicode

This option controls escaping of the unicode characters.

Use this option if your terminal cannot print unicode, or if the output gets somehow corrupted because of the unicode characters.

Parameters
return
bool

Current escape_unicode value. Initially, this option is set to False.

new_escape_unicode
bool

If True, all unicode characters in the verbose log will be written in hexadecimal notation. If False, no escaping of the unicode characters will be performed.

datatable.options.progress

class
progress

This namespace contains the following progress reporting options:

.allow_interruption

Option that controls if the datatable tasks could be interrupted.

.callback

A custom progress-reporting function.

.clear_on_success

Option that controls if the progress bar is cleared on success.

.enabled

Option that controls if the progress reporting is enabled.

.min_duration

The minimum duration of a task to show the progress bar.

.updates_per_second

The progress bar update frequency.

datatable.options.progress.allow_interruption

This option controls if the datatable tasks could be interrupted.

Parameters
return
bool

Current allow_interruption value. Initially, this option is set to True.

new_allow_interruption
bool

New allow_interruption value. If True, datatable will be allowed to handle the SIGINT signal to interrupt long-running tasks. If False, it will not be possible to interrupt tasks with SIGINT.

datatable.options.progress.callback

This option controls the custom progress-reporting function.

Parameters
return
function

Current callback value. Initially, this option is set to None.

new_callback
function

New callback value. If None, then the built-in progress-reporting function will be used. Otherwise, the new_callback specifies a function to be called at each progress event. The function should take a single parameter p, which is a namedtuple with the following fields:

  • p.progress is a float in the range 0.0 .. 1.0;

  • p.status is a string, one of 'running', 'finished', 'error' or 'cancelled';

  • p.message is a custom string describing the operation currently being performed.

datatable.options.progress.clear_on_success

This option controls if the progress bar is cleared on success.

Parameters
return
bool

Current clear_on_success value. Initially, this option is set to False.

new_clear_on_success
bool

New clear_on_success value. If True, the progress bar is cleared when job finished successfully. If False, the progress remains visible even when the job has already finished.

datatable.options.progress.enabled

This option controls if the progress reporting is enabled.

Parameters
return
bool

Current enabled value. Initially, this option is set to True if the stdout is connected to a terminal or a Jupyter Notebook, and False otherwise.

new_enabled
bool

New enabled value. If True, the progress reporting functionality will be turned on. If False, it is turned off.

datatable.options.progress.min_duration

This option controls the minimum duration of a task to show the progress bar.

Parameters
return
float

Current min_duration value. Initially, this option is set to 0.5.

new_min_duration
float

New min_duration value. The progress bar will not be shown if the duration of an operation is smaller than new_min_duration. If this value is non-zero, then the progress bar will only be shown for long-running operations, whose duration (estimated or actual) exceeds this threshold.

datatable.options.progress.updates_per_second

This option controls the progress bar update frequency.

Parameters
return
float

Current updates_per_second value. Initially, this option is set to 25.0.

new_updates_per_second
float

New updates_per_second value. This is the number of times per second the display of the progress bar should be updated.

datatable.options.nthreads

This option controls the number of threads used by datatable for parallel calculations.

Parameters
return
int

Current nthreads value. Initially, this option is set to the value returned by C++ call std::thread::hardware_concurrency(), and usually equals to the number of available cores.

new_nthreads
int

New nthreads value. It can be greater or smaller than the initial setting. For example, setting nthreads = 1 will force the library into a single-threaded mode. Setting nthreads to 0 will restore the initial value equal to the number of processor cores. Setting nthreads to a value less than 0 is equivalent to requesting that fewer threads than the maximum.

datatable.re

match()

Search for a regular expression within a column.

datatable.re.match()

Test whether values in a string column match a regular expression.

Parameters
column
FExpr[str]

The column expression where you want to search for regular expression matches.

pattern
str

The regular expression that will be tested against each value in the column.

return
FExpr[bool8]

A boolean column that tells whether the value in each row of column matches the pattern or not.

datatable.str

len()

Compute length of a string column.

slice()

Apply a slice to a string column.

split_into_nhot()

Split and nhot-encode a single-column frame

datatable.str.len()

Compute lengths of values in a string column.

Parameters
column
FExpr[str]
return
FExpr[int64]

datatable.str.slice()

Apply slice [start:stop:step] to each value in a column of string type.

Instead of this function you can directly apply a slice expression to the column expression:

- ``f.A[1:-1]`` is equivalent to - ``dt.str.slice(f.A, 1, -1)``.
Parameters
column
FExpr[str]

The column to which the slice should be applied.

return
FExpr[str]

A column containing sliced string values from the source column.

Examples
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates", "eggplants", "figs", "grapes", "kiwi"]) DT[:, dt.str.slice(f.A, None, 5)]
A
str32
0apple
1banan
2cherr
3dates
4eggpl
5figs
6grape
7kiwi

datatable.str.split_into_nhot()

split_into_nhot
(
sep=",",
sort=False
)

Split and nhot-encode a single-column frame.

Each value in the frame, having a single string column, is split according to the provided separator sep, the whitespace is trimmed, and the resulting pieces (labels) are converted into the individual columns of the output frame.

Parameters
frame
Frame

An input single-column frame. The column stype must be either str32 or str64.

sep
str

Single-character separator to be used for splitting.

sort
bool

An option to control whether the resulting column names, i.e. labels, should be sorted. If set to True, the column names are returned in alphabetical order, otherwise their order is not guaranteed due to the algorithm parallelization.

return
Frame

The output frame. It will have as many rows as the input frame, and as many boolean columns as there were unique labels found. The labels will also become the output column names.

except
ValueError | TypeError
dt.exceptions.ValueError

Raised if the input frame is missing or it has more than one column. It is also raised if sep is not a single-character string.

dt.exceptions.TypeError

Raised if the single column of frame has non-string stype.

Examples
DT = dt.Frame(["cat,dog", "mouse", "cat,mouse", "dog,rooster", "mouse,dog,cat"]) DT
C0
str32
0cat,dog
1mouse
2cat,mouse
3dog,rooster
4mouse,dog,cat
dt.split_into_nhot(DT)
catdogmouserooster
bool8bool8bool8bool8
01100
10010
21010
30101
41110

datatable.time

day()

Return day component of a date.

day_of_week()

Compute day of week for the given date.

hour()

Return hour component of a timestamp.

minute()

Return minute component of a timestamp.

month()

Return month component of a date.

nanosecond()

Return nanosecond component of a timestamp.

second()

Return the number of seconds in a timestamp.

year()

Return year component of a date.

ymd(y,m,d)

Create a date32 column from year, month, and day components.

ymdt(y,m,d,H,M,S)

Create a time64 column from year, month, day, hour, minute, and second components.

datatable.time.day()

Added in version 1.0.0

Retrieve the “day” component of a date32 or time64 column.

Parameters
date
FExpr[date32] | FExpr[time64]

A column for which you want to compute the day part.

return
FExpr[int32]

The day part of the source column.

Examples
DT = dt.Frame([1, 1000, 100000], stype='date32') DT[:, {'date': f[0], 'day': dt.time.day(f[0])}]
dateday
date32int32
01970-01-022
11972-09-2727
22243-10-1717
See Also
  • year() – retrieve the “year” component of a date

  • month() – retrieve the “month” component of a date

datatable.time.day_of_week()

Added in version 1.0.0

For a given date column compute the corresponding days of week.

Days of week are returned as integers from 1 to 7, where 1 represents Monday, and 7 is Sunday. Thus, the return value of this function matches the ISO standard.

Parameters
date
FExpr[date32] | FExpr[time64]

The date32 (or time64) column for which you need to calculate days of week.

return
FExpr[int32]

An integer column, with values between 1 and 7 inclusive.

Examples
DT = dt.Frame([18000, 18600, 18700, 18800, None], stype='date32') DT[:, {"date": f[0], "day-of-week": dt.time.day_of_week(f[0])}]
dateday-of-week
date32int32
02019-04-147
12020-12-045
22021-03-147
32021-06-222
4NANA

datatable.time.hour()

Added in version 1.0.0

Retrieve the “hour” component of a time64 column. The returned value will always be in the range [0; 23].

Parameters
time
FExpr[time64]

A column for which you want to compute the hour part.

return
FExpr[int32]

The hour part of the source column.

Examples
from datetime import datetime as d DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)]) DT[:, {'time': f[0], 'hour': dt.time.hour(f[0])}]
timehour
time64int32
02020-05-11T12:00:00 12
12021-06-14T16:10:59.39487316
See Also
  • minute() – retrieve the “minute” component of a timestamp

  • second() – retrieve the “second” component of a timestamp

  • nanosecond() – retrieve the “nanosecond” component of a timestamp

datatable.time.minute()

Added in version 1.0.0

Retrieve the “minute” component of a time64 column. The produced column will have values in the range [0; 59].

Parameters
time
FExpr[time64]

A column for which you want to compute the minute part.

return
FExpr[int32]

The minute part of the source column.

Examples
from datetime import datetime as d DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)]) DT[:, {'time': f[0], 'minute': dt.time.minute(f[0])}]
timeminute
time64int32
02020-05-11T12:00:00 0
12021-06-14T16:10:59.39487310
See Also
  • hour() – retrieve the “hour” component of a timestamp

  • second() – retrieve the “second” component of a timestamp

  • nanosecond() – retrieve the “nanosecond” component of a timestamp

datatable.time.month()

Added in version 1.0.0

Retrieve the “month” component of a date32 or time64 column.

Parameters
date
FExpr[date32] | FExpr[time64]

A column for which you want to compute the month part.

return
FExpr[int32]

The month part of the source column.

Examples
DT = dt.Frame([1, 1000, 100000], stype='date32') DT[:, {'date': f[0], 'month': dt.time.month(f[0])}]
datemonth
date32int32
01970-01-021
11972-09-279
22243-10-1710
See Also
  • year() – retrieve the “year” component of a date

  • day() – retrieve the “day” component of a date

datatable.time.nanosecond()

Added in version 1.0.0

Retrieve the “nanosecond” component of a time64 column. The produced column will have values in the range [0; 999999999].

Parameters
time
FExpr[time64]

A column for which you want to compute the nanosecond part.

return
FExpr[int32]

The “nanosecond” part of the source column.

Examples
from datetime import datetime as d DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)]) DT[:, {'time': f[0], 'ns': dt.time.nanosecond(f[0])}]
timens
time64int32
02020-05-11T12:00:00 0
12021-06-14T16:10:59.394873394873000
See Also
  • hour() – retrieve the “hour” component of a timestamp

  • minute() – retrieve the “minute” component of a timestamp

  • second() – retrieve the “second” component of a timestamp

datatable.time.second()

Added in version 1.0.0

Retrieve the “second” component of a time64 column. The produced column will have values in the range [0; 59].

Parameters
time
FExpr[time64]

A column for which you want to compute the second part.

return
FExpr[int32]

The “second” part of the source column.

Examples
from datetime import datetime as d DT = dt.Frame([d(2020, 5, 11, 12, 0, 0), d(2021, 6, 14, 16, 10, 59, 394873)]) DT[:, {'time': f[0], 'second': dt.time.second(f[0])}]
timesecond
time64int32
02020-05-11T12:00:00 0
12021-06-14T16:10:59.39487359
See Also
  • hour() – retrieve the “hour” component of a timestamp

  • minute() – retrieve the “minute” component of a timestamp

  • nanosecond() – retrieve the “nanosecond” component of a timestamp

datatable.time.year()

Added in version 1.0.0

Retrieve the “year” component of a date32 or time64 column.

Parameters
date
FExpr[date32] | FExpr[time64]

A column for which you want to compute the year part.

return
FExpr[int32]

The year part of the source column.

Examples
DT = dt.Frame([1, 1000, 100000], stype='date32') DT[:, {'date': f[0], 'year': dt.time.year(f[0])}]
dateyear
date32int32
01970-01-021970
11972-09-271972
22243-10-172243
See Also
  • month() – retrieve the “month” component of a date

  • day() – retrieve the “day” component of a date

datatable.time.ymd()

Added in version 1.0.0

Create a date32 column out of year, month and day components.

This function performs range checks on month and day columns: if a certain combination of year/month/day is not valid in the Gregorian calendar, then an NA value will be produced in that row.

Parameters
year
FExpr[int]

The year part of the resulting date32 column.

month
FExpr[int]

The month part of the resulting date32 column. Values in this column are expected to be in the 1 .. 12 range.

day
FExpr[int]

The day part of the resulting date32 column. Values in this column should be from 1 to last_day_of_month(year, month).

return
FExpr[date32]
Examples
DT = dt.Frame(y=[2005, 2010, 2015], m=[2, 3, 7]) DT[:, dt.time.ymd(f.y, f.m, 30)]
C0
date32
0NA
12010-03-30
22015-07-30

datatable.time.ymdt()

Added in version 1.0.0

Create a time64 column out of year, month, day, hour, minute, second and nanosecond (optional) components. Alternatively, instead of year-month-day triple you can pass date argument of type date32.

This function performs range checks on month and day columns: if a certain combination of year/month/day is not valid in the Gregorian calendar, then an NA value will be produced in that row.

At the same time, there are no range checks for time components. Thus, you can, for example, pass second=3600 instead of hour=1.

Parameters
year
FExpr[int]

The year part of the resulting time64 column.

month
FExpr[int]

The month part of the resulting time64 column. Values in this column must be in the 1 .. 12 range.

day
FExpr[int]

The day part of the resulting time64 column. Values in this column should be from 1 to last_day_of_month(year, month).

hour
FExpr[int]

The hour part of the resulting time64 column.

minute
FExpr[int]

The minute part of the resulting time64 column.

second
FExpr[int]

The second part of the resulting time64 column.

nanosecond
FExpr[int]

The nanosecond part of the resulting time64 column. This parameter is optional.

date
FExpr[date32]

The date component of the resulting time64 column. This parameter, if given, replaces parameters year, month and day, and cannot be used together with them.

return
FExpr[time64]
Examples
DT = dt.Frame(Y=[2001, 2003, 2005, 2020, 1960], M=[1, 5, 4, 11, 8], D=[12, 18, 30, 1, 14], h=[7, 14, 22, 23, 12], m=[15, 30, 0, 59, 0], s=[12, 23, 0, 59, 27], ns=[0, 0, 0, 999999000, 123000]) DT[:, [f[:], dt.time.ymdt(f.Y, f.M, f.D, f.h, f.m, f.s, f.ns)]]
YMDhmsnsC0
int32int32int32int32int32int32int32time64
020011127151202001-01-12T07:15:12
1200351814302302003-05-18T14:30:23
22005430220002005-04-30T22:00:00
320201112359599999990002020-11-01T23:59:59.999999
41960814120271230001960-08-14T12:00:27.000123

datatable.FExpr

FExpr is an object that encapsulates computations to be done on a frame.

FExpr objects are rarely constructed directly (though it is possible too), instead they are more commonly created as inputs/outputs from various functions in datatable.

Consider the following example:

math.sin(2 * f.Angle)

Here accessing column “Angle” in namespace f creates an FExpr. Multiplying this FExpr by a python scalar 2 creates a new FExpr. And finally, applying the sine function creates yet another FExpr. The resulting expression can be applied to a frame via the DT[i,j] method, which will compute that expression using the data of that particular frame.

Thus, an FExpr is a stored computation, which can later be applied to a Frame, or to multiple frames.

Because of its delayed nature, an FExpr checks its correctness at the time when it is applied to a frame, not sooner. In particular, it is possible for the same expression to work with one frame, but fail with another. In the example above, the expression may raise an error if there is no column named “Angle” in the frame, or if the column exists but has non-numeric type.

Most functions in datatable that accept an FExpr as an input, return a new FExpr as an output, thus creating a tree of FExprs as the resulting evaluation graph.

Also, all functions that accept FExprs as arguments, will also accept certain other python types as an input, essentially converting them into FExprs. Thus, we will sometimes say that a function accepts FExpr-like objects as arguments.

All binary operators op(x, y) listed below work when either x or y, or both are FExprs.

Construction

.__init__(e)

Create an FExpr.

.extend()

Append another FExpr.

.remove()

Remove columns from the FExpr.

Arithmeritc operators

__add__(x, y)

Addition x + y.

__sub__(x,  y)

Subtraction x - y.

__mul__(x, y)

Multiplication x * y.

__truediv__(x, y)

Division x / y.

__floordiv__(x, y)

Integer division x // y.

__mod__(x, y)

Modulus x % y (the remainder after integer division).

__pow__(x, y)

Power x ** y.

__pos__(x)

Unary plus +x.

__neg__(x)

Unary minus -x.

Bitwise operators

__and__(x, y)

Bitwise AND x & y.

__or__(x, y)

Bitwise OR x | y.

__xor__(x, y)

Bitwise XOR x ^ y.

__invert__(x)

Bitwise NOT ~x.

__lshift__(x, y)

Left shift x << y.

__rshift__(x, y)

Right shift x >> y.

Relational operators

__eq__(x, y)

Equal x == y.

__ne__(x, y)

Not equal x != y.

__lt__(x, y)

Less than x < y.

__le__(x, y)

Less than or equal x <= y.

__gt__(x, y)

Greater than x > y.

__ge__(x, y)

Greater than or equal x >= y.

Equivalents of base datatable functions

Miscellaneous

.__bool__()

Implicitly convert FExpr into a boolean value.

.__getitem__()

Apply slice to a string column.

.__repr__()

Used by Python function repr().

.len()

String length.

.re_match(pattern)

Check whether the string column matches a pattern.

datatable.FExpr.__add__()

Add two FExprs together, which corresponds to python operator +.

If x or y are multi-column expressions, then they must have the same number of columns, and the + operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The result of adding two columns with different stypes will have the following stype:

  • max(x.stype, y.stype, int32) if both columns are numeric (i.e. bool, int or float);

  • str32/str64 if at least one of the columns is a string. In this case the + operator implements string concatenation, same as in Python.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x + y.

datatable.FExpr.__and__()

Compute bitwise AND of x and y.

If x or y are multi-column expressions, then they must have the same number of columns, and the & operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The AND operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.

When both x and y are boolean, then the bitwise AND operator is equivalent to logical AND. This can be used to combine several logical conditions into a compound (since Python doesn’t allow overloading of operator and). Beware, however, that & has higher precedence than and, so it is advisable to always use parentheses:

DT[(f.x >= 0) & (f.x <= 1), :]
Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x & y.

Notes

Note

Use x & y in order to AND two boolean FExprs. Using standard Python keyword and will result in an error.

datatable.FExpr.__bool__()

Using this operator will result in a TypeError.

The boolean-cast operator is used by Python whenever it wants to know whether the object is equivalent to a single True or False value. This is not applicable for a dt.FExpr, which represents stored computation on a column or multiple columns. As such, an error is raised.

In order to convert a column into the boolean stype, you can use the type-cast operator dt.bool8(x).

datatable.FExpr.__eq__()

Compare whether values in columns x and y are equal.

Like all other FExpr operators, the equality operator is elementwise: it produces a column where each element is the result of comparison x[i] == y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the == operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The equality operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison. In practice it means, for example, that `1 == "1".

Lastly, the comparison x == None is exactly equivalent to the isna() function.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x == y. The produced column will have stype bool8.

datatable.FExpr.__floordiv__()

Perform integer division of two FExprs, i.e. x // y.

The modulus and integer division together satisfy the identity that x == (x // y) * y + (x % y) for all non-zero values of y.

If x or y are multi-column expressions, then they must have the same number of columns, and the // operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The integer division operation can only be applied to integer columns. The resulting column will have stype equal to the largest of the stypes of both columns, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x // y.

See also
  • x / y – regular division operator.

datatable.FExpr.__ge__()

Compare whether x >= y.

Like all other FExpr operators, the greater-than-or-equal operator is elementwise: it produces a column where each element is the result of comparison x[i] >= y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the >= operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The greater-than-or-equal operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x >= y. The produced column will have stype bool8.

datatable.FExpr.__getitem__()

Apply a slice to the string column represented by this FExpr.

Parameters
self
FExpr[str]
selector
slice

The slice will be applied to each value in the string column self.

return
FExpr[str]
Examples
DT = dt.Frame(season=["Winter", "Summer", "Autumn", "Spring"], i=[1, 2, 3, 4]) DT[:, {"start": f.season[:-f.i], "end": f.season[-f.i:]}]
startend
str32str32
0Winter
1Summer
2Autumn
3Spring
See Also

datatable.FExpr.__gt__()

Compare whether x > y.

Like all other FExpr operators, the greater-than operator is elementwise: it produces a column where each element is the result of comparison x[i] > y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the > operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The greater-than operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x > y. The produced column will have stype bool8.

datatable.FExpr.__init__()

Create a new dt.FExpr object out of e.

The FExpr serves as a simple wrapper of the underlying object, allowing it to be combined with othef FExprs.

This constructor almost never needs to be run manually by the user.

Parameters
e
None | bool | int | str | float | slice | list | tuple | dict | type | stype | ltype | Generator | FExpr | Frame | range | pd.DataFrame | pd.Series | np.array | np.ma.masked_array

The argument that will be converted into an FExpr.

datatable.FExpr.__invert__()

Compute bitwise NOT of x, which corresponds to python operation ~x.

If x is a multi-column expressions, then the ~ operator will be applied to each column in turn.

Bitwise NOT can only be applied to integer or boolean columns. The resulting column will have the same stype as its argument.

When the argument x is a boolean column, then ~x is equivalent to logical NOT. This can be used to negate a condition, similar to python operator not (which is not overloadable).

Parameters
x
FExpr

Either an FExpr, or any object that can be converted into FExpr.

return
FExpr

An expression that evaluates ~x.

Notes

Note

Use ~x in order to negate a boolean FExpr. Using standard Python keyword not will result in an error.

datatable.FExpr.__le__()

Compare whether x <= y.

Like all other FExpr operators, the less-than-or-equal operator is elementwise: it produces a column where each element is the result of comparison x[i] <= y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the <= operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The less-than-or-equal operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x <= y. The produced column will have stype bool8.

datatable.FExpr.__lshift__()

Shift x by y bits to the left, i.e. x << y. Mathematically this is equivalent to \(x\cdot 2^y\).

If x or y are multi-column expressions, then they must have the same number of columns, and the << operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x << y.

See also

datatable.FExpr.__lt__()

Compare whether x < y.

Like all other FExpr operators, the less-than operator is elementwise: it produces a column where each element is the result of comparison x[i] < y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the < operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The less-than operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x < y. The produced column will have stype bool8.

datatable.FExpr.__mod__()

Compute the remainder of division of two FExprs, i.e. x % y.

The modulus and integer division together satisfy the identity that x == (x // y) * y + (x % y) for all non-zero values of y. In addition, the result of x % y is always in the range [0; y) for positive y, and in the range (y; 0] for negative y.

If x or y are multi-column expressions, then they must have the same number of columns, and the % operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The integer division operation can only be applied to integer columns. The resulting column will have stype equal to the largest of the stypes of both columns, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x % y.

See also
  • x // y – integer division operator.

datatable.FExpr.__mul__()

Multiply two FExprs together, which corresponds to python operator *.

If x or y are multi-column expressions, then they must have the same number of columns, and the * operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The multiplication operation can only be applied to numeric columns. The resulting column will have stype equal to the larger of the stypes of its arguments, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x * y.

datatable.FExpr.__ne__()

Compare whether values in columns x and y are not equal.

Like all other FExpr operators, the equality operator is elementwise: it produces a column where each element is the result of comparison x[i] != y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the != operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The inequality operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x != y. The produced column will have stype bool8.

datatable.FExpr.__neg__()

Unary minus, which corresponds to python operation -x.

If x is a multi-column expressions, then the - operator will be applied to each column in turn.

Unary minus can only be applied to numeric columns. The resulting column will have the same stype as its argument, but not less than int32.

Parameters
x
FExpr

Either an FExpr, or any object that can be converted into FExpr.

return
FExpr

An expression that evaluates -x.

datatable.FExpr.__or__()

Compute bitwise OR of x and y.

If x or y are multi-column expressions, then they must have the same number of columns, and the | operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The OR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.

When both x and y are boolean, then the bitwise OR operator is equivalent to logical OR. This can be used to combine several logical conditions into a compound (since Python doesn’t allow overloading of operator or). Beware, however, that | has higher precedence than or, so it is advisable to always use parentheses:

DT[(f.x < -1) | (f.x > 1), :]
Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x | y.

Notes

Note

Use x | y in order to OR two boolean FExprs. Using standard Python keyword or will result in an error.

datatable.FExpr.__pos__()

Unary plus, which corresponds to python operation +x.

If x is a multi-column expressions, then the + operator will be applied to each column in turn.

Unary plus can only be applied to numeric columns. The resulting column will have the same stype as its argument, but not less than int32.

Parameters
x
FExpr

Either an FExpr, or any object that can be converted into FExpr.

return
FExpr

An expression that evaluates +x.

datatable.FExpr.__pow__()

Raise x to the power y, or in math notation \(x^y\).

If x or y are multi-column expressions, then they must have the same number of columns, and the ** operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The power operator can only be applied to numeric columns, and the resulting column will have stype float64 in all cases except when both arguments are float32 (in which case the result is also float32).

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x ** y.

datatable.FExpr.__repr__()

Return string representation of this object. This method is used by Python’s built-in function repr().

The returned string has the following format:

"FExpr<...>"

where ... will attempt to match the expression used to construct this FExpr.

Examples
repr(3 + 2*(f.A + f["B"]))
"FExpr<3 + 2 * (f.A + f['B'])>"

datatable.FExpr.__rshift__()

Shift x by y bits to the right, i.e. x >> y. Mathematically this is equivalent to \(\lfloor x\cdot 2^{-y} \rfloor\).

If x or y are multi-column expressions, then they must have the same number of columns, and the >> operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x >> y.

See also

datatable.FExpr.__sub__()

Subtract two FExprs, which corresponds to python operation x - y.

If x or y are multi-column expressions, then they must have the same number of columns, and the - operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The subtraction operation can only be applied to numeric columns. The resulting column will have stype equal to the larger of the stypes of its arguments, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x - y.

datatable.FExpr.__truediv__()

Divide two FExprs, which corresponds to python operation x / y.

If x or y are multi-column expressions, then they must have the same number of columns, and the / operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The division operation can only be applied to numeric columns. The resulting column will have stype float64 in all cases except when both arguments have stype float32 (in which case the result is also float32).

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x / y.

See also
  • x // y – integer division operator.

datatable.FExpr.__xor__()

Compute bitwise XOR of x and y.

If x or y are multi-column expressions, then they must have the same number of columns, and the ^ operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The XOR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.

When both x and y are boolean, then the bitwise XOR operator is equivalent to logical XOR. This can be used to combine several logical conditions into a compound (since Python doesn’t allow overloading of operator xor). Beware, however, that ^ has higher precedence than xor, so it is advisable to always use parentheses:

DT[(f.x == 0) ^ (f.y == 0), :]
Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x ^ y.

datatable.FExpr.count()

datatable.FExpr.extend()

Append FExpr arg to the current FExpr.

Each FExpr represents a collection of columns, or a columnset. This method takes two such columnsets and combines them into a single one, similar to cbind().

Parameters
arg
FExpr

The expression to append.

return
FExpr

New FExpr which is a combination of the current FExpr and arg.

See also
  • remove() – remove columns from a columnset.

datatable.FExpr.first()

datatable.FExpr.last()

datatable.FExpr.len()

Deprecated since version 0.11

This method is deprecated and will be removed in version 1.1.0. Please use dt.str.len() instead.

datatable.FExpr.max()

datatable.FExpr.mean()

datatable.FExpr.median()

datatable.FExpr.min()

datatable.FExpr.re_match()

Deprecated since version 1.0.0

This method is deprecated and will be removed in version 1.1.0. Please use dt.re.match() instead.

datatable.FExpr.remove()

Remove columns arg from the current FExpr.

Each FExpr represents a collection of columns, or a columnset. Some of those columns are computed while others are specified “by reference”, for example f.A, f[:3] or f[int]. This method allows you to remove by-reference columns from an existing FExpr.

Parameters
arg
FExpr

The columns to remove. These must be “columns-by-reference”, i.e. they cannot be computed columns.

return
FExpr

New FExpr which is a obtained from the current FExpr by removing the columns in arg.

See also

datatable.FExpr.rowall()

datatable.FExpr.rowany()

datatable.FExpr.rowcount()

Equivalent to dt.rowcount(self).

datatable.FExpr.rowfirst()

Equivalent to dt.rowfirst(self).

datatable.FExpr.rowlast()

datatable.FExpr.rowmax()

datatable.FExpr.rowmean()

datatable.FExpr.rowmin()

datatable.FExpr.rowsd()

datatable.FExpr.rowsum()

datatable.FExpr.sd()

datatable.FExpr.shift()

datatable.FExpr.sum()

datatable.Frame

class
Frame

Two-dimensional column-oriented container of data. This the primary data structure in the datatable module.

A Frame is two-dimensional in the sense that it is comprised of rows and columns of data. Each data cell can be located via a pair of its coordinates: (irow, icol). We do not support frames with more or less than two dimensions.

A Frame is column-oriented in the sense that internally the data is stored separately for each column. Each column has its own name and type. Types may be different for different columns but cannot vary within each column.

Thus, the dimensions of a Frame are not symmetrical: a Frame is not a matrix. Internally the class is optimized for the use case when the number of rows significantly exceeds the number of columns.

A Frame can be viewed as a list of columns: standard Python function len() will return the number of columns in the Frame, and frame[j] will return the column at index j (each “column” will be a Frame with ncols == 1). Similarly, you can iterate over the columns of a Frame in a loop, or use it in a *-expansion:

for column in frame: # do something list_of_columns = [*frame]

A Frame can also be viewed as a dict of columns, where the key associated with each column is its name. Thus, frame[name] will return the column with the requested name. A Frame can also work with standard python **-expansion:

dict_of_columns = {**frame}

Construction

Frame(*args, **kws)

Construct the frame from various Python sources.

dt.fread(src)

Read an external file and convert into a Frame.

.copy()

Create a copy of the frame.

Properties

.key

The primary key for the Frame, if any.

.ltypes

Logical types (dt.ltypes) of all columns.

.meta

The frame’s meta information.

.names

The names of all columns in the frame.

.ncols

Number of columns in the frame.

.nrows

Number of rows in the frame.

.stype

A tuple (number of rows, number of columns).

.source

Where this frame was loaded from.

.stype

The common dt.stype for the entire frame.

.stypes

Storage types (dt.stypes) of all columns.

.type

The common type (dt.Type) for the entire frame.

.types

types (dt.Types) of all columns.

Frame manipulation

frame[i, j, ...]

Primary method for extracting data from a frame.

frame[i, j, ...] = values

Update data within the frame.

del frame[i, j, ...]

Remove rows/columns/values from the frame.

.cbind(*frames)

Append columns of other frames to this frame.

.rbind(*frames)

Append other frames at the bottom of the current.

.replace(what, with)

Search and replace values in the frame.

.sort(cols)

Sort the frame by the specified columns.

Convert into other formats

.to_arrow()

Convert the frame into an Arrow table.

.to_csv(file)

Write the frame’s data into CSV format.

.to_dict()

Convert the frame into a Python dictionary, by columns.

.to_jay(file)

Store the frame’s data into a binary file in Jay format.

.to_list()

Return the frame’s data as a list of lists, by columns.

.to_numpy()

Convert the frame into a numpy array.

.to_pandas()

Convert the frame into a pandas DataFrame.

.to_tuples()

Return the frame’s data as a list of tuples, by rows.

Statistical methods

.countna()

Count missing values for each column in the frame.

.countna1()

Count missing values for a one-column frame and return it as a scalar.

.kurt()

Calculate excess kurtosis for each column in the frame.

.kurt1()

Calculate excess kurtosis for a one-column frame and return it as a scalar.

.max()

Find the largest element for each column in the frame.

.max1()

Find the largest element for a one-column frame and return it as a scalar.

.mean()

Calculate the mean value for each column in the frame.

.mean1()

Calculate the mean value for a one-column frame and return it as a scalar.

.min()

Find the smallest element for each column in the frame.

.min1()

Find the smallest element for a one-column frame and return it as a scalar.

.mode()

Find the mode value for each column in the frame.

.mode1()

Find the mode value for a one-column frame and return it as a scalar.

.nmodal()

Calculate the modal frequency for each column in the frame.

.nmodal1()

Calculate the modal frequency for a one-column frame and return it as a scalar.

.nunique()

Count the number of unique values for each column in the frame.

.nunique1()

Count the number of unique values for a one-column frame and return it as a scalar.

.sd()

Calculate the standard deviation for each column in the frame.

.sd1()

Calculate the standard deviation for a one-column frame and return it as a scalar.

.skew()

Calculate skewness for each column in the frame.

.skew1()

Calculate skewness for a one-column frame and return it as a scalar.

.sum()

Calculate the sum of all values for each column in the frame.

.sum1()

Calculate the sum of all values for a one-column column frame and return it as a scalar.

Miscellaneous methods

.colindex(name)

Find the position of a column in the frame by its name.

.export_names()

Create python variables for each column of the frame.

.head()

Return the first few rows of the frame.

.materialize()

Make sure all frame’s data is physically written to memory.

.tail()

Return the last few rows of the frame.

Special methods

These methods are not intended to be called manually, instead they provide a way for datatable to interoperate with other Python modules or builtin functions.

.__copy__()

Used by Python module copy.

.__deepcopy__()

Used by Python module copy.

.__delitem__()

Method that implements the del DT[...] call.

.__getitem__()

Method that implements the DT[...] call.

.__getstate__()

Used by Python module pickle.

.__init__(...)

The constructor function.

.__iter__()

Used by Python function iter(), or when the frame is used as a target in a loop.

.__len__()

Used by Python function len().

.__repr__()

Used by Python function repr().

.__reversed__()

Used by Python function reversed().

.__setitem__()

Method that implements the DT[...] = expr call.

.__setstate__()

Used by Python module pickle.

.__sizeof__()

Used by sys.getsizeof.

.__str__()

Used by Python function str.

._repr_html_()

Used to display the frame in Jupyter Lab.

._repr_pretty_()

Used to display the frame in an IPython console.

datatable.Frame.__init__()

Frame
(
_data=None,
*,
names=None,
types=None,
type=None,
)

Create a new Frame from a single or multiple sources.

Argument _data (or **cols) contains the source data for Frame’s columns. Column names are either derived from the data, given explicitly via the names argument, or generated automatically. Either way, the constructor ensures that column names are unique, non-empty, and do not contain certain special characters (see Name mangling for details).

Parameters
_data
Any

The first argument to the constructor represents the source from which to construct the Frame. If this argument is given, then the varkwd arguments **cols should not be used.

This argument can accept a wide range of data types; see the “Details” section below.

**cols
Any

Sequence of varkwd column initializers. The keys become column names, and the values contain column data. Using varkwd arguments is equivalent to passing a dict as the _data argument.

When varkwd initializers are used, the names parameter may not be given.

names
List[str|None]

Explicit list (or tuple) of column names. The number of elements in the list must be the same as the number of columns being constructed.

This parameter should not be used when constructing the frame from **cols.

types
List[Type] | Dict[str, Type]

Explicit list (or dict) of column types. The number of elements in the list must be the same as the number of columns being constructed.

type
Type | type

Similar to types, but provide a single type that will be used for all columns. This option cannot be used together with types.

return
Frame

A Frame object is constructed and returned.

except
ValueError

The exception is raised if the lengths of names or types lists are different from the number of columns created, or when creating several columns and they have incompatible lengths.

Details

The shape of the constructed Frame depends on the type of the source argument _data (or **cols). The argument _data and varkwd arguments **cols are mutually exclusive: they cannot be used at the same time. However, it is possible to use neither and construct an empty frame:

dt.Frame() # empty 0x0 frame dt.Frame(None) # same dt.Frame([]) # same

The varkwd arguments **cols can be used to construct a Frame by columns. In this case the keys become column names, and the values are column initializers. This form is mostly used for convenience; it is equivalent to converting cols into a dict and passing as the first argument:

dt.Frame(A = range(7), B = [0.1, 0.3, 0.5, 0.7, None, 1.0, 1.5], C = ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]) # equivalent to dt.Frame({"A": range(7), "B": [0.1, 0.3, 0.5, 0.7, None, 1.0, 1.5], "C": ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]})
ABC
int32float64str32
000.1red
110.3orange
220.5yellow
330.7green
44NAblue
551indigo
661.5violet

The argument _data accepts a wide range of input types. The following list describes possible choices:

List[List | Frame | np.array | pd.DataFrame | pd.Series | range | typed_list]

When the source is a non-empty list containing other lists or compound objects, then each item will be interpreted as a column initializer, and the resulting frame will have as many columns as the number of items in the list.

Each element in the list must produce a single column. Thus, it is not allowed to use multi-column Frames, or multi-dimensional numpy arrays or pandas DataFrames.

dt.Frame([[1, 3, 5, 7, 11], [12.5, None, -1.1, 3.4, 9.17]])
C0C1
int32float64
0112.5
13NA
25-1.1
373.4
4119.17

Note that unlike pandas and numpy, we treat a list of lists as a list of columns, not a list of rows. If you need to create a Frame from a row-oriented store of data, you can use a list of dictionaries or a list of tuples as described below.

List[Dict]

If the source is a list of dict objects, then each element in this list is interpreted as a single row. The keys in each dictionary are column names, and the values contain contents of each individual cell.

The rows don’t have to have the same number or order of entries: all missing elements will be filled with NAs:

dt.Frame([{"A": 3, "B": 7}, {"A": 0, "B": 11, "C": -1}, {"C": 5}])
ABC
int32int32int32
037NA
1011-1
2NANA5

If the names parameter is given, then only the keys given in the list of names will be taken into account, all extra fields will be discarded.

List[Tuple]

If the source is a list of tuples, then each tuple represents a single row. The tuples must have the same size, otherwise an exception will be raised:

dt.Frame([(39, "Mary"), (17, "Jasmine"), (23, "Lily")], names=['age', 'name'])
agename
int32str32
039Mary
117Jasmine
223Lily

If the tuples are in fact namedtuples, then the field names will be used for the column names in the resulting Frame. No check is made whether the named tuples in fact belong to the same class.

List[Any]

If the list’s first element does not match any of the cases above, then it is considered a “list of primitives”. Such list will be parsed as a single column.

The entries are typically bools, ints, floats, strs, or Nones; numpy scalars are also allowed. If the list has elements of heterogeneous types, then we will attempt to convert them to the smallest common stype.

If the list contains only boolean values (or Nones), then it will create a column of type bool8.

If the list contains only integers (or Nones), then the resulting column will be int8 if all integers are 0 or 1; or int32 if all entries are less than \(2^{31}\) in magnitude; otherwise int64 if all entries are less than \(2^{63}\) in magnitude; or otherwise float64.

If the list contains floats, then the resulting column will have stype float64. Both None and math.nan can be used to input NA values.

Finally, if the list contains strings then the column produced will have stype str32 if the total size of the character is less than 2Gb, or str64 otherwise.

typed_list

A typed list can be created by taking a regular list and dividing it by an stype. It behaves similarly to a simple list of primitives, except that it is parsed into the specific stype.

dt.Frame([1.5, 2.0, 3.87] / dt.float32).type
Type.float32
Dict[str, Any]

The keys are column names, and values can be any objects from which a single-column frame can be constructed: list, range, np.array, single-column Frame, pandas series, etc.

Constructing a frame from a dictionary d is exactly equivalent to calling dt.Frame(list(d.values()), names=list(d.keys())).

range

Same as if the range was expanded into a list of integers, except that the column created from a range is virtual and its creation time is nearly instant regardless of the range’s length.

Frame

If the argument is a Frame, then a shallow copy of that frame will be created, same as .copy().

str

If the source is a simple string, then the frame is created by fread-ing this string. In particular, if the string contains the name of a file, the data will be loaded from that file; if it is a URL, the data will be downloaded and parsed from that URL. Lastly, the string may simply contain a table of data.

DT1 = dt.Frame("train.csv") dt.Frame(""" Name Age Mary 39 Jasmine 17 Lily 23 """)
NameAge
str32int32
0Mary39
1Jasmine17
2LilyNA
pd.DataFrame | pd.Series

A pandas DataFrame (Series) will be converted into a datatable Frame. Column names will be preserved.

Column types will generally be the same, assuming they have a corresponding stype in datatable. If not, the column will be converted. For example, pandas date/time column will get converted into string, while float16 will be converted into float32.

If a pandas frame has an object column, we will attempt to refine it into a more specific stype. In particular, we can detect a string or boolean column stored as object in pandas.

np.array

A numpy array will get converted into a Frame of the same shape (provided that it is 2- or less- dimensional) and the same type.

If possible, we will create a Frame without copying the data (however, this is subject to numpy’s approval). The resulting frame will have a copy-on-write semantics.

pyarrow.Table

An arrow table will be converted into a datatable Frame, preserving column names and types.

If the arrow table has columns of types not supported by datatable (for example lists or structs), an exception will be raised.

None

When the source is not given at all, then a 0x0 frame will be created; unless a names parameter is provided, in which case the resulting frame will have 0 rows but as many columns as given in the names list.

datatable.Frame.__copy__()

This method facilitates copying of a Frame via the python standard module copy. See .copy() for more details.

datatable.Frame.__delitem__()

del
self
[
i,
j,
[by],
[sort],
]

This methods deletes rows and columns that would have been selected from the frame if not for the del keyword.

All parameters have the same meaning as in the getter DT[i, j, ...], with the only restriction that j must select columns from the main frame only (i.e. not from the joined frame(s)), and it must select them by reference. Selecting by reference means it should be possible to tell where each column was in the original frame.

There are several modes of delete operation, depending on whether i or j are “slice-all” symbols:

  • del DT[:, :] removes everything from the frame, making it 0x0;

  • del DT[:, j] removes columns j from the frame;

  • del DT[i, :] removes rows i from the frame;

  • del DT[i, j] the shape of the frame remains the same, but the elements at [i, j] locations are replaced with NAs.

datatable.Frame.__getitem__()

The main method for accessing data and computing on the frame. Sometimes we also refer to it as the DT[i, j, ...] call.

Since Python does not support keyword arguments inside square brackets, all arguments are positional. The first is the row selector i, the second is the column selector j, and the rest are optional. Thus, DT[i, j] selects rows i and columns j from frame DT.

If an additional by argument is present, then the selectors i and j work within groups generated by the by() expression. The sort argument reorders the rows of the frame, and the join argument allows performing SQL joins between several frames.

The signature listed here is the most generic. But there are also special-case signatures DT[j] and DT[i, j] described below.

Parameters
i
int | slice | Frame | FExpr | List[bool] | List[Any]

The row selector.

If this is an integer or a slice, then the behavior is the same as in Python when working on a list with .nrows elements. In particular, the integer value must be within the range [-nrows; nrows). On the other hand when i is a slice, then either its start or end or both may be safely outside the row-range of the frame. The trivial slice : always selects all rows.

i may also be a single-column boolean Frame. It must have the same number of rows as the current frame, and it serves as a mask for which rows are to be selected: True indicates that the row should be included in the result, while False and None skips the row.

i may also be a single-column integer Frame. Such column specifies directly which row indices are to be selected. This is more flexible compared to a boolean column: the rows may be repeated, reordered, omitted, etc. All values in the column i must be in the range [0; nrows) or an error will be thrown. In particular, negative indices are not allowed. Also, if the column contains NA values, then it would produce an “invalid row”, i.e. a row filled with NAs.

i may also be an expression, which must evaluate into a single column, either boolean or integer. In this case the result is the same as described above for a single-column frame.

When i is a list of booleans, then it is equivalent to a single-column boolean frame. In particular, the length of the list must be equal to .nrows.

Finally, i can be a list of any of the above (integers, slices, frames, expressions, etc), in which case each element of the list is evaluated separately and then all selected rows are put together. The list may contain Nones, which will be simply skipped.

j
int | str | slice | list | dict | type | FExpr | update

This argument may either select columns, or perform computations with the columns.

int

Select a single column at the specified index. A dt.exceptions.IndexError is raised if j is not in the range [-ncols; ncols).

str

Select a single column by name. A dt.exceptions.KeyError is raised if the column with such a name does not exist.

:

This is a trivial slice, and it means “select everything”, and is roughly equivalent to SQL’s *. In the simple case of DT[i, j] call “selecting everything” means all columns from frame DT. However, when the by() clause is added, then : will now select all columns except those used in the groupby. And if the expression has a join(), then “selecting everything” will produce all columns from all frames, excluding those that were duplicate during a natural join.

slice[int]

An integer slice can be used to select a subset of columns. The behavior of a slice is exactly the same as in base Python.

slice[str]

A string slice is an expression like "colA":"colZ". In this case all columns from "colA" to "colZ" inclusive are selected. And if "colZ" appears before "colA” in the frame, then the returned columns will be in the reverse order.

Both endpoints of the slice must be valid columns (or omitted), or otherwise a dt.exceptions.KeyError will be raised.

type | stype | ltype

Select only columns of the matching type.

FExpr

An expression formula is computed within the current evaluation context (i.e. it takes into account the current frame, the filter i, the presence of groupby/join parameters, etc). The result of this evaluation is used as-if that colum existed in the frame.

List[bool]

If j is a list of boolean values, then it must have the length of .ncols, and it describes which columns are to be selected into the result.

List[Any]

The j can also be a list of elements of any other type listed above, with the only restriction that the items must be homogeneous. For example, you can mix ints and slice[int]s, but not ints and FExprs, or ints and strs.

Each item in the list will be evaluated separately (as if each was the sole element in j), and then all the results will be put together.

Dict[str, FExpr]

A dictionary can be used to select columns/expressions similarly to a list, but assigning them explicit names.

update

As a special case, the j argument may be the update() function, which turns the selection operation into an update. That is, instead of returning the chosen rows/columns, they will be updated instead with the user-supplied values.

by
by

When by() clause is present in the square brackets, the rest of the computations are carried out within the “context of a groupby”. This should generally be equivalent to (a) splitting the frame into separate sub-frames corresponding to each group, (b) applying DT[i, j] separately within each group, (c) row-binding the results for each group. In practice the following operations are affected:

  • all reduction operators such as dt.min() or dt.sum() now work separately within each group. Thus, instead of computing sum over the entire column, it is computed separately within each group in by(), and the resulting column will have as many rows as the number of groups.

  • certain i expressions are re-interpreted as being applied within each group. For example, if i is an integer or a slice, then it will now be selecting row(s) within each group.

  • certain functions (such as dt.shift()) are also “group-aware”, and produce results that take into account the groupby context. Check documentation for each individual function to find out whether it has special treatment for groupby contexts.

In addition, by() also affects the order of columns in the output frame. Specifically, all columns listed as the groupby keys will be automatically placed at the front of the resulting frame, and also excluded from : or f[:] within j.

sort
sort

This argument can be used to rearrange rows in the resulting frame. See sort() for details.

join
join

Performs a JOIN operation with another frame. The join() clause will calculate how the rows of the current frame match against the rows of the joined frame, and allow you to refer to the columns of the joined frame within i, j or by. In order to access columns of the joined frame use namespace g..

This parameter may be listed multiple times if you need to join with several frames.

return
Frame | None

If j is an update() clause then current frame is modified in-place and nothing is returned.

In all other cases, the returned value is a Frame object constructed from the selected rows and columns (including the computed columns) of the current frame.

Details

The order of evaluation of expressions is that first the join clause(s) are computed, creating a mapping between the rows of the current frame and the joined frame(s). After that we evaluate by+sort. Next, the i filter is applied creating the final index of rows that will be selected. Lastly, we evaluate the j part, taking into account the current groupby and row index(es).

When evaluating j, it is essentially converted into a tree (DAG) of expressions, where each expression is evaluated from the bottom up. That is, we start evaluating from the leaf nodes (which are usually column selectors such as f[0]), and then at each convert the set of columns into a new set. Importantly, each subexpression node may produce columns of 3 types: “scalar”, “grouped”, and “full-size”. Whenever subexpressions of different levels are mixed together, they are upgraded to the highest level. Thus, a scalar may be reused for each group, and a grouped column can interoperate with a regular column by auto-expanding in such a way that it becomes constant within each group.

If, after the j is fully evaluated, it produces a column set of type “grouped”, then the resulting frame will have as many rows as there are groups. If, on the other hand, the column set is “full-size”, then the resulting frame will have as many rows as the original frame.

See Also

Extract a single column j from the frame.

The single-argument version of DT[i, j] works only for j being either an integer (indicating column index) or a string (column name). If you need any other way of addressing column(s) of the frame, use the more versatile DT[:, j] form.

Parameters
j
int | str

The index or name of a column to retrieve.

return
Frame

Single-column frame containing the column at the specified index or with the given name.

except
KeyError | IndexError

KeyError

raised if the column with the given name does not exist in the frame.

IndexError

raised if the column does not exist at the provided index j.

Extract a single value from the frame.

Parameters
i
int

The index of a row

j
int | str

The index or name of a column.

return
None | bool | int | float | str | object

A single value from the frame’s row i and column j.

datatable.Frame.__getstate__()

This method allows the frame to be pickle-able.

Pickling a Frame involves saving it into a bytes object in Jay format, but may be less efficient than saving into a file directly because Python creates a copy of the data for the bytes object.

See .to_jay() for more details and caveats about saving into Jay format.

datatable.Frame.__iter__()

Returns an iterator over the frame’s columns.

The iterator is a light-weight object of type frame_iterator, which yields consequent columns of the frame with each iteration.

Thus, the iterator produces the sequence frame[0], frame[1], frame[2], ... until the end of the frame. This works even if the user adds or deletes columns in the frame while iterating. Be careful when inserting/deleting columns at an index that was already iterated over, as it will cause some columns to be skipped or visited more than once.

This method is not intended for manual use. Instead, it is invoked by Python runtime either when you call iter(), or when you use the frame in a loop:

for column in frame: # column is a Frame of shape (frame.nrows, 1) ...

datatable.Frame.__len__()

Returns the number of columns in the Frame, same as .ncols property.

This special method is used by the python built-in function len(), and allows the dt.Frame class to satisfy python Iterable interface.

datatable.Frame.__repr__()

Returns a simple representation of the frame as a string. This method is used by Python’s built-in function repr().

The returned string has the following format:

f"<Frame#{ID} {nrows}x{ncols}>"

where {ID} is the value of id(frame) in hex format. Thus, each frame has its own unique id, though after one frame is deleted its id may be reused by another frame.

See Also

datatable.Frame.__reversed__()

Returns an iterator over the frame’s columns in reverse order.

This is similar to .__iter__(), except that the columns are returned in the reverse order, i.e. frame[-1], frame[-2], frame[-3], etc.

This function is not intended for manual use. Instead, it is invoked by Python builtin function reversed().

datatable.Frame.__setitem__()

This methods updates values within the frame, or adds new columns to the frame.

All parameters have the same meaning as in the getter DT[i, j, ...], with the only restriction that j must select to columns by reference (i.e. there could not be any computed columns there). On the other hand, j may contain columns that do not exist in the frame yet: these columns will be created.

Parameters
i
...

Row selector.

j
...

Column selector. Computed columns are forbidden, but not-existing (new) columns are allowed.

by
by

Groupby condition.

join
join

Join criterion.

R
FExpr | List[FExpr] | Frame | type | None | bool | int | float | str

The replacement for the selection on the left-hand-side.

None | bool | int | float | str

A simple python scalar can be assigned to any-shape selection on the LHS. If i selects all rows (i.e. the assignment is of the form DT[:, j] = R), then each column in j will be replaced with a constant column containing the value R.

If, on the other hand, i selects only some rows, then the type of R must be consistent with the type of column(s) selected in j. In this case only cells in subset [i, j] will be updated with the value of R; the columns may be promoted within their ltype if the value of R is large in magnitude.

type | stype | ltype

Assigning a type to one or more columns will change the types of those columns. The row selector i must be “slice-all” :.

Frame | FExpr | List[FExpr]

When a frame or an expression is assigned, then the shape of the RHS must match the shape of the LHS. Similarly to the assignment of scalars, types must be compatible when assigning to a subset of rows.

See Also
  • dt.update() – An alternative way to update values in the frame within DT[i, j] getter.

  • .replace() – Search and replace for certain values within the entire frame.

A simplified form of the setter, suitable for a single-column replacement. In this case j may only be an integer or a string.

datatable.Frame.__sizeof__()

Return the size of this Frame in memory.

The function attempts to compute the total memory size of the frame as precisely as possible. In particular, it takes into account not only the size of data in columns, but also sizes of all auxiliary internal structures.

Special cases: if frame is a view (say, d2 = DT[:1000, :]), then the reported size will not contain the size of the data, because that data “belongs” to the original datatable and is not copied. However if a frame selects only a subset of columns (say, d3 = DT[:, :5]), then a view is not created and instead the columns are copied by reference. Frame d3 will report the “full” size of its columns, even though they do not occupy any extra memory compared to DT. This behavior may be changed in the future.

This function is not intended for manual use. Instead, in order to get the size of a frame DT, call sys.getsizeof(DT).

datatable.Frame.__str__()

Returns a string with the Frame’s data formatted as a table, i.e. the same representation as displayed when trying to inspect the frame from Python console.

Different aspects of the stringification process can be controlled via dt.options.display options; but under the default settings the returned string will be sufficiently small to fit into a typical terminal window. If the frame has too many rows/columns, then only a small sample near the start+end of the frame will be rendered.

See Also

datatable.Frame.cbind()

Append columns of one or more frames to the current Frame.

For example, if the current frame has n columns, and you are appending another frame with k columns, then after this method succeeds, the current frame will have n + k columns. Thus, this method is roughly equivalent to pandas.concat(axis=1).

The frames being cbound must all either have the same number of rows, or some of them may have only a single row. Such single-row frames will be automatically expanded, replicating the value as needed. This makes it easy to create constant columns or to append reduction results (such as min/max/mean/etc) to the current Frame.

If some of the frames have an incompatible number of rows, then the operation will fail with an dt.exceptions.InvalidOperationError. However, if you set the flag force to True, then the error will no longer be raised - instead all frames that are shorter than the others will be padded with NAs.

If the frames being appended have the same column names as the current frame, then those names will be mangled to ensure that the column names in the current frame remain unique. A warning will also be issued in this case.

Parameters
frames
Frame | List[Frame] | None

The list/tuple/sequence/generator expression of Frames to append to the current frame. The list may also contain None values, which will be simply skipped.

force
bool

If True, allows Frames to be appended even if they have unequal number of rows. The resulting Frame will have number of rows equal to the largest among all Frames. Those Frames which have less than the largest number of rows, will be padded with NAs (with the exception of Frames having just 1 row, which will be replicated instead of filling with NAs).

return
None

This method alters the current frame in-place, and doesn’t return anything.

except
InvalidOperationError

If trying to cbind frames with the number of rows different from the current frame’s, and the option force is not set.

Notes

Cbinding frames is a very cheap operation: the columns are copied by reference, which means the complexity of the operation depends only on the number of columns, not on the number of rows. Still, if you are planning to cbind a large number of frames, it will be beneficial to collect them in a list first and then call a single cbind() instead of cbinding them one-by-one.

It is possible to cbind frames using the standard DT[i,j] syntax:

df[:, update(**frame1, **frame2, ...)]

Or, if you need to append just a single column:

df["newcol"] = frame1
Examples
DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0]) frame1 = dt.Frame(N=[-1, -2, -5]) DT.cbind(frame1) DT
ABN
int32int32int32
014-1
127-2
230-5
See also
  • dt.cbind() – function for cbinding frames “out-of-place” instead of in-place;

  • .rbind() – method for row-binding frames.

datatable.Frame.colindex()

Return the position of the column in the Frame.

The index of the first column is 0, just as with regular python lists.

Parameters
column
str | int | FExpr

If string, then this is the name of the column whose index you want to find.

If integer, then this represents a column’s index. The return value is thus the same as the input argument column, provided that it is in the correct range. If the column argument is negative, then it is interpreted as counting from the end of the frame. In this case the positive value column + ncols is returned.

Lastly, column argument may also be an f-expression such as f.A or f[3]. This case is treated as if the argument was simply "A" or 3. More complicated f-expressions are not allowed and will result in a TypeError.

return
int

The numeric index of the provided column. This will be an integer between 0 and self.ncols - 1.

except
KeyError | IndexError

dt.exceptions.KeyError

raised if the column argument is a string, and the column with such name does not exist in the frame. When this exception is thrown, the error message may contain suggestions for up to 3 similarly looking column names that actually exist in the Frame.

dt.exceptions.IndexError

raised if the column argument is an integer that is either greater than or equal to .ncols or less than -ncols.

Examples
df = dt.Frame(A=[3, 14, 15], B=["enas", "duo", "treis"], C=[0, 0, 0]) df.colindex("B")
1
df.colindex(-1)
2
from datatable import f df.colindex(f.A)
0

datatable.Frame.copy()

Make a copy of the frame.

The returned frame will be an identical copy of the original, including column names, types, and keys.

By default, copying is shallow with copy-on-write semantics. This means that only the minimal information about the frame is copied, while all the internal data buffers are shared between the copies. Nevertheless, due to the copy-on-write semantics, any changes made to one of the frames will not propagate to the other; instead, the data will be copied whenever the user attempts to modify it.

It is also possible to explicitly request a deep copy of the frame by setting the parameter deep to True. With this flag, the returned copy will be truly independent from the original. The returned frame will also be fully materialized in this case.

Parameters
deep
bool

Flag indicating whether to return a “shallow” (default), or a “deep” copy of the original frame.

return
Frame

A new Frame, which is the copy of the current frame.

Examples
DT1 = dt.Frame(range(5)) DT2 = DT1.copy() DT2[0, 0] = -1 DT2
C0
int32
0-1
11
22
33
44
DT1
C0
int32
00
11
22
33
44
Notes
  • Non-deep frame copy is a very low-cost operation: its speed depends on the number of columns only, not on the number of rows. On a regular laptop copying a 100-column frame takes about 30-50µs.

  • Deep copying is more expensive, since the data has to be physically written to new memory, and if the source columns are virtual, then they need to be computed too.

  • Another way to create a copy of the frame is using a DT[i, j] expression (however, this will not copy the key property):

    DT[:, :]
  • Frame class also supports copying via the standard Python library copy:

    import copy DT_shallow_copy = copy.copy(DT) DT_deep_copy = copy.deepcopy(DT)

datatable.Frame.countna()

Report the number of NA values in each column of the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All columns will have stype int64.

Examples
DT = dt.Frame(A=[1, 5, None], B=[math.nan]*3, C=[None, None, 'bah!']) DT.countna()
ABC
int64int64int64
0132
DT.countna().to_tuples()[0] (1, 3, 2)
See Also
  • .countna1() – similar to this method, but operates on a single-column frame only, and returns a number instead of a Frame.

  • dt.count() – function for counting non-NA (“valid”) values in a column; can also be applied per-group.

datatable.Frame.countna1()

Return the number of NA values in a single-column Frame.

This function is a shortcut for:

DT.countna()[0, 0]
Parameters
except
ValueError

If called on a Frame that has more or less than one column.

return
int
See Also
  • .countna() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

  • dt.count() – function for counting non-NA (“valid”) values in a column; can also be applied per-group.

datatable.Frame.export_names()

Added in version 0.10

Return a tuple of f-expressions for all columns of the frame.

For example, if the frame has columns “A”, “B”, and “C”, then this method will return a tuple of expressions (f.A, f.B, f.C). If you assign these to, say, variables A, B, and C, then you will be able to write column expressions using the column names directly, without using the f symbol:

A, B, C = DT.export_names() DT[A + B > C, :]

The variables that are “exported” refer to each column by name. This means that you can use the variables even after reordering the columns. In addition, the variables will work not only for the frame they were exported from, but also for any other frame that has columns with the same names.

Parameters
return
Tuple[Expr, ...]

The length of the tuple is equal to the number of columns in the frame. Each element of the tuple is a datatable expression, and can be used primarily with the DT[i,j] notation.

Notes
  • This method is effectively equivalent to:

    def export_names(self): return tuple(f[name] for name in self.names)
  • If you want to export only a subset of column names, then you can either subset the frame first, or use *-notation to ignore the names that you do not plan to use:

    A, B = DT[:, :2].export_names() # export the first two columns A, B, *_ = DT.export_names() # same
  • Variables that you use in code do not have to have the same names as the columns:

    Price, Quantity = DT[:, ["sale price", "quant"]].export_names()

datatable.Frame.head()

Return the first n rows of the frame.

If the number of rows in the frame is less than n, then all rows are returned.

This is a convenience function and it is equivalent to DT[:n, :].

Parameters
n
int

The maximum number of rows to return, 10 by default. This number cannot be negative.

return
Frame

A frame containing the first up to n rows from the original frame, and same columns.

Examples
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates", "eggplants", "figs", "grapes", "kiwi"]) DT.head(4)
A
str32
0apples
1bananas
2cherries
3dates
See also
  • .tail() – return the last n rows of the Frame.

datatable.Frame.key

The tuple of column names that are the primary key for this frame.

If the frame has no primary key, this property returns an empty tuple.

The primary key columns are always located at the beginning of the frame, and therefore the following holds:

DT.key == DT.names[:len(DT.key)]

Assigning to this property will make the Frame keyed by the specified column(s). The key columns will be moved to the front, and the Frame will be sorted. The values in the key columns must be unique.

Parameters
return
Tuple[str, ...]

When used as a getter, returns the tuple of names of the primary key columns.

new_key
str | List[str] | Tuple[str, ...] | None

Specify a column or a list of columns that will become the new primary key of the Frame. Object columns cannot be used for a key. The values in the key column must be unique; if multiple columns are assigned as the key, then their combined (tuple-like) values must be unique.

If new_key is None, then this is equivalent to deleting the key. When the key is deleted, the key columns remain in the frame, they merely stop being marked as “key”.

except
ValueError

Raised when the values in the key column(s) are not unique.

except
KeyError

Raised when new_key contains a column name that doesn’t exist in the Frame.

Examples
DT = dt.Frame(A=range(5), B=['one', 'two', 'three', 'four', 'five']) DT.key = 'B' DT
BA
str32int32
five4
four3
one0
three2
two1

datatable.Frame.keys()

Returns a tuple of column names, same as .names property.

This method is not intended for public use. It is needed in order for dt.Frame to satisfy Python’s Mapping interface.

datatable.Frame.kurt()

Calculate the excess kurtosis for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All the columns will have float64 stype. For non-numeric columns this function returns NA values.

See Also
  • .kurt1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

datatable.Frame.kurt1()

Calculate the excess kurtosis for a one-column frame and return it as a scalar.

This function is a shortcut for:

DT.kurt()[0, 0]
Parameters
return
None | float

None is returned for non-numeric columns.

except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .kurt() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

datatable.Frame.ltypes

Deprecated since version 1.0.0

This property is deprecated and will be removed in version 1.2.0. Please use .types instead.

The tuple of each column’s ltypes (“logical types”).

Parameters
return
Tuple[ltype, ...]

The length of the tuple is the same as the number of columns in the frame.

See also
  • .stypes – tuple of columns’ storage types

datatable.Frame.materialize()

Force all data in the Frame to be laid out physically.

In datatable, a Frame may contain “virtual” columns, i.e. columns whose data is computed on-the-fly. This allows us to have better performance for certain types of computations, while also reducing the total memory footprint. The use of virtual columns is generally transparent to the user, and datatable will materialize them as needed.

However, there could be situations where you might want to materialize your Frame explicitly. In particular, materialization will carry out all delayed computations and break internal references on other Frames’ data. Thus, for example if you subset a large frame to create a smaller subset, then the new frame will carry an internal reference to the original, preventing it from being garbage-collected. However, if you materialize the small frame, then the data will be physically copied, allowing the original frame’s memory to be freed.

Parameters
to_memory
bool

If True, then, in addition to de-virtualizing all columns, this method will also copy all memory-mapped columns into the RAM.

When you open a Jay file, the Frame that is created will contain memory-mapped columns whose data still resides on disk. Calling .materialize(to_memory=True) will force the data to be loaded into the main memory. This may be beneficial if you are concerned about the disk speed, or if the file is on a removable drive, or if you want to delete the source file.

return
None

This operation modifies the frame in-place.

datatable.Frame.max()

Find the largest value in each column of the frame.

Parameters
return
Frame

The frame will have one row and the same number, names and stypes of columns as in the current frame. For string/object columns this function returns NA values.

See Also
  • .max1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

  • dt.max() – function for finding largest values in a column or an expression; can also be applied per-group.

datatable.Frame.max1()

Return the largest value in a single-column Frame. The frame’s stype must be numeric.

This function is a shortcut for:

DT.max()[0, 0]
Parameters
return
bool | int | float

The returned value corresponds to the stype of the frame.

except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .max() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

  • dt.max() – function for counting max values in a column or an expression; can also be applied per-group.

datatable.Frame.mean()

Calculate the mean value for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All columns will have float64 stype. For string/object columns this function returns NA values.

See Also
  • .mean1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

  • dt.mean() – function for counting mean values in a column or an expression; can also be applied per-group.

datatable.Frame.mean1()

Calculate the mean value for a single-column Frame.

This function is a shortcut for:

DT.mean()[0, 0]
Parameters
return
None | float

None is returned for string/object columns.

except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .mean() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

  • dt.mean() – function for calculatin mean values in a column or an expression; can also be applied per-group.

datatable.Frame.meta

Added in version 1.0

Frame’s meta information.

This property contains meta information, if any, as set by datatable functions and methods. It is a settable property, so that users can also update it with any information relevant to a particular frame.

It is not guaranteed that the existing meta information will be preserved by the functions and methods called on the frame. In particular, it is not preserved when exporting data into a Jay file or pickling the data. This behavior may change in the future.

The default value for this property is None.

Parameters
return
dict | None

If the frame carries any meta information, the corresponding meta information dictionary is returned, None is returned otherwise.

new_meta
dict | None

New meta information.

datatable.Frame.min()

Find the smallest value in each column of the frame.

Parameters
return
Frame

The frame will have one row and the same number, names and stypes of columns as in the current frame. For string/object columns this function returns NA values.

See Also
  • .min1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

  • dt.min() – function for counting min values in a column or an expression; can also be applied per-group.

datatable.Frame.min1()

Find the smallest value in a single-column Frame. The frame’s stype must be numeric.

This function is a shortcut for:

DT.min()[0, 0]
Parameters
return
bool | int | float

The returned value corresponds to the stype of the frame.

except
ValueError

If called on a Frame that has more or less than 1 column.

See Also
  • .min() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

  • dt.min() – function for counting min values in a column or an expression; can also be applied per-group.

datatable.Frame.mode()

Find the mode for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame.

See Also
  • .mode1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

datatable.Frame.mode1()

Find the mode for a single-column Frame.

This function is a shortcut for:

DT.mode()[0, 0]
Parameters
return
bool | int | float | str | object

The returned value corresponds to the stype of the column.

except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .mode() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

datatable.Frame.names

The tuple of names of all columns in the frame.

Each name is a non-empty string not containing any ASCII control characters, and jointly the names are unique within the frame.

This property is also assignable: setting DT.names has the effect of renaming the frame’s columns without changing their order. When renaming, the length of the new list of names must be the same as the number of columns in the frame. It is also possible to rename just a few of the columns by assigning a dictionary {oldname: newname}. Any column not listed in the dictionary will keep its old name.

When setting new column names, we will verify whether they satisfy the requirements mentioned above. If not, a warning will be emitted and the names will be automatically mangled.

Parameters
return
Tuple[str, ...]

When used in getter form, this property returns the names of all frame’s columns, as a tuple. The length of the tuple is equal to the number of columns in the frame, .ncols.

new_names
List[str?] | Tuple[str?, ...] | Dict[str, str?] | None

The most common form is to assign the list or tuple of new column names. The length of the new list must be equal to the number of columns in the frame. Some (or all) elements in the list may be None’s, indicating that that column should have an auto-generated name.

If new_names is a dictionary, then it provides a mapping from old to new column names. The dictionary may contain less entries than the number of columns in the frame: the columns not mentioned in the dictionary will retain their names.

Setting the .names to None is equivalent to using the del keyword: the names will be set to their default values, which are usually C0, C1, ....

except
ValueError | KeyError

dt.exceptions.ValueError

raised If the length of the list/tuple new_names does not match the number of columns in the frame.

dt.exceptions.KeyError

raised If new_names is a dictionary containing entries that do not match any of the existing columns in the frame.

Examples
DT = dt.Frame([[1], [2], [3]]) DT.names = ['A', 'B', 'C'] DT.names
('A', 'B', 'C')
DT.names = {'B': 'middle'} DT.names
('A', 'middle', 'C')
del DT.names DT.names
('C0', 'C1', 'C2)

datatable.Frame.ncols

Number of columns in the frame.

Parameters
return
int

The number of columns can be either zero or a positive integer.

Notes

The expression len(DT) also returns the number of columns in the frame DT. Such usage, however, is not recommended.

See also
  • .nrows: getter for the number of rows of the frame.

datatable.Frame.nmodal()

Calculate the modal frequency for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All the columns will have int64 stype.

See Also
  • .nmodal1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

datatable.Frame.nmodal1()

Calculate the modal frequency for a single-column Frame.

This function is a shortcut for:

DT.nmodal()[0, 0]
Parameters
return
int
except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .nmodal() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

datatable.Frame.nrows

Number of rows in the Frame.

Assigning to this property will change the height of the Frame, either by truncating if the new number of rows is smaller than the current, or filling with NAs if the new number of rows is greater.

Increasing the number of rows of a keyed Frame is not allowed.

Parameters
return
int

The number of rows can be either zero or a positive integer.

n
int

The new number of rows for the frame, this should be a non-negative integer.

See also
  • .ncols: getter for the number of columns of the frame.

datatable.Frame.nunique()

Count the number of unique values for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All the columns will have int64 stype.

See Also
  • .nunique1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

datatable.Frame.nunique1()

Count the number of unique values for a one-column frame and return it as a scalar.

This function is a shortcut for:

DT.nunique()[0, 0]
Parameters
return
int
except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .nunique() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

datatable.Frame.rbind()

Append rows of frames to the current frame.

This is equivalent to list.extend() in Python: the frames are combined by rows, i.e. rbinding a frame of shape [n x k] to a Frame of shape [m x k] produces a frame of shape [(m + n) x k].

This method modifies the current frame in-place. If you do not want the current frame modified, then use the dt.rbind() function.

If frame(s) being appended have columns of types different from the current frame, then these columns will be promoted according to the standard promotion rules. In particular, booleans can be promoted into integers, which in turn get promoted into floats. However, they are not promoted into strings or objects.

If frames have columns of incompatible types, a TypeError will be raised.

If you need to append multiple frames, then it is more efficient to collect them into an array first and then do a single rbind(), than it is to append them one-by-one in a loop.

Appending data to a frame opened from disk will force loading the current frame into memory, which may fail with an OutOfMemory exception if the frame is sufficiently big.

Parameters
frames
Frame | List[Frame]

One or more frames to append. These frames should have the same columnar structure as the current frame (unless option force is used).

force
bool

If True, then the frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs.

bynames
bool

If True (default), the columns in frames are matched by their names. For example, if one frame has columns [“colA”, “colB”, “colC”] and the other [“colB”, “colA”, “colC”] then we will swap the order of the first two columns of the appended frame before performing the append. However if bynames is False, then the column names will be ignored, and the columns will be matched according to their order, i.e. i-th column in the current frame to the i-th column in each appended frame.

return
None

datatable.Frame.replace()

Replace given value(s) replace_what with replace_with in the entire Frame.

For each replace value, this method operates only on columns of types appropriate for that value. For example, if replace_what is a list [-1, math.inf, None, "??"], then the value -1 will be replaced in integer columns only, math.inf only in real columns, None in columns of all types, and finally "??" only in string columns.

The replacement value must match the type of the target being replaced, otherwise an exception will be thrown. That is, a bool must be replaced with a bool, an int with an int, a float with a float, and a string with a string. The None value (representing NA) matches any column type, and therefore can be used as either replacement target, or replace value for any column. In particular, the following is valid: DT.replace(None, [-1, -1.0, ""]). This will replace NA values in int columns with -1, in real columns with -1.0, and in string columns with an empty string.

The replace operation never causes a column to change its logical type. Thus, an integer column will remain integer, string column remain string, etc. However, replacing may cause a column to change its stype, provided that ltype remains constant. For example, replacing 0 with -999 within an int8 column will cause that column to be converted into the int32 stype.

Parameters
replace_what
None | bool | int | float | list | dict

Value(s) to search for and replace.

replace_with
single value | list

The replacement value(s). If replace_what is a single value, then this must be a single value too. If replace_what is a list, then this could be either a single value, or a list of the same length. If replace_what is a dict, then this value should not be passed.

return
None

Nothing is returned, the replacement is performed in-place.

Examples
df = dt.Frame([1, 2, 3] * 3) df.replace(1, -1) df
C0
int32
0-1
12
23
3-1
42
53
6-1
72
83
df.replace({-1: 100, 2: 200, "foo": None}) df
C0
int32
0100
1200
23
3100
4200
53
6100
7200
83

datatable.Frame.sd()

Calculate the standard deviation for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All the columns will have float64 stype. For non-numeric columns this function returns NA values.

See Also
  • .sd1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

  • dt.sd() – function for calculating the standard deviation in a column or an expression; can also be applied per-group.

datatable.Frame.sd1()

Calculate the standard deviation for a one-column frame and return it as a scalar.

This function is a shortcut for:

DT.sd()[0, 0]
Parameters
return
None | float

None is returned for non-numeric columns.

except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .sd() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

  • dt.sd() – function for calculating the standard deviation in a column or an expression; can also be applied per-group.

datatable.Frame.skew()

Calculate the skewness for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All the columns will have float64 stype. For non-numeric columns this function returns NA values.

See Also
  • .skew1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

datatable.Frame.skew1()

Calculate the skewness for a one-column frame and return it as a scalar.

This function is a shortcut for:

DT.skew()[0, 0]
Parameters
return
None | float

None is returned for non-numeric columns.

except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .skew() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

datatable.Frame.shape

Tuple with (nrows, ncols) dimensions of the frame.

This property is read-only.

Parameters
return
Tuple[int, int]

Tuple with two integers: the first is the number of rows, the second is the number of columns.

See also
  • .nrows – getter for the number of rows;

  • .ncols – getter for the number of columns.

datatable.Frame.sort()

Sort frame by the specified column(s).

Parameters
cols
List[str | int]

Names or indices of the columns to sort by. If no columns are given, the Frame will be sorted on all columns.

return
Frame

New Frame sorted by the provided column(s). The current frame remains unmodified.

datatable.Frame.source

Added in version 0.11

The name of the file where this frame was loaded from.

This is a read-only property that describes the origin of the frame. When a frame is loaded from a Jay or CSV file, this property will contain the name of that file. Similarly, if the frame was opened from a URL or a from a shell command, the source will report the original URL / the command.

Certain sources may be converted into a Frame only partially, in such case the source property will attempt to reflect this fact. For example, when opening a multi-file zip archive, the source will contain the name of the file within the archive. Similarly, when opening an XLS file with several worksheets, the source property will contain the name of the XLS file, the name of the worksheet, and possibly even the range of cells that were read.

Parameters
return
str | None

If the frame was loaded from a file or similar resource, the name of that file is returned. If the frame was computed, or its data modified, the property will return None.

datatable.Frame.stype

Deprecated since version 1.0.0

This property is deprecated and will be removed in version 1.2.0. Please use .types instead.

Added in version 0.10.0

The common dt.stype for all columns.

This property is well-defined only for frames where all columns have the same stype.

Parameters
return
stype | None

For frames where all columns have the same stype, this common stype is returned. If a frame has 0 columns, None will be returned.

except
InvalidOperationError

This exception will be raised if the columns in the frame have different stypes.

See also
  • .stypes – tuple of stypes for all columns.

datatable.Frame.stypes

Deprecated since version 1.0.0

This property is deprecated and will be removed in version 1.2.0. Please use .types instead.

The tuple of each column’s stypes (“storage types”).

Parameters
return
Tuple[stype, ...]

The length of the tuple is the same as the number of columns in the frame.

See also
  • .stype – common stype for all columns

  • .ltypes – tuple of columns’ logical types

datatable.Frame.sum()

Calculate the sum of all values for each column in the frame.

Parameters
return
Frame

The frame will have one row and the same number/names of columns as in the current frame. All the columns will have float64 stype. For non-numeric columns this function returns NA values.

See Also
  • .sum1() – similar to this method, but operates on a single-column frame only, and returns a scalar value instead of a Frame.

  • dt.sum() – function for calculating the sum of all the values in a column or an expression; can also be applied per-group.

datatable.Frame.sum1()

Calculate the sum of all values for a one-column column frame and return it as a scalar.

This function is a shortcut for:

DT.sum()[0, 0]
Parameters
return
None | float

None is returned for non-numeric columns.

except
ValueError

If called on a Frame that has more or less than one column.

See Also
  • .sum() – similar to this method, but can be applied to a Frame with an arbitrary number of columns.

  • dt.sum() – function for calculating the sum of all the values in a column or an expression; can also be applied per-group.

datatable.Frame.tail()

Return the last n rows of the frame.

If the number of rows in the frame is less than n, then all rows are returned.

This is a convenience function and it is equivalent to DT[-n:, :] (except when n is 0).

Parameters
n
int

The maximum number of rows to return, 10 by default. This number cannot be negative.

return
Frame

A frame containing the last up to n rows from the original frame, and same columns.

Examples
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates", "eggplants", "figs", "grapes", "kiwi"]) DT.tail(3)
A
str32
0figs
1grapes
2kiwi
See also
  • .head() – return the first n rows of the Frame.

datatable.Frame.to_arrow()

Convert this frame into a pyarrow.Table object. The pyarrow module must be installed.

The conversion is multi-threaded and done in C++, but it does involve creating a copy of the data, except for the cases when the data was originally imported from Arrow. This is caused by differences in the data storage formats of datatable and Arrow.

Parameters
return
pyarrow.Table

A Table object is always returned, even if the source is a single-column datatable Frame.

except
ImportError

If the pyarrow module is not installed.

datatable.Frame.to_csv()

to_csv
(
path=None,
*,
quoting="minimal",
append=False,
header="auto",
bom=False,
hex=False,
verbose=False,
method="auto"
)

Write the contents of the Frame into a CSV file.

This method uses multiple threads to serialize the Frame’s data. The number of threads is can be configured using the global option dt.options.nthreads.

The method supports simple writing to file, appending to an existing file, or creating a python string if no filename was provided. Optionally, the output could be gzip-compressed.

Parameters
path
str

Path to the output CSV file that will be created. If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.

quoting
csv.QUOTE_* | "minimal" | "all" | "nonnumeric" | "none"
"minimal" | csv.QUOTE_MINIMAL

quote the string fields only as necessary, i.e. if the string starts or ends with the whitespace, or contains quote characters, separator, or any of the C0 control characters (including newlines, etc).

"all" | csv.QUOTE_ALL

all fields will be quoted, both string, numeric, and boolean.

"nonnumeric" | csv.QUOTE_NONNUMERIC

all string fields will be quoted.

"none" | csv.QUOTE_NONE

none of the fields will be quoted. This option must be used at user’s own risk: the file produced may not be valid CSV.

append
bool

If True, the file given in the path parameter will be opened for appending (i.e. mode=”a”), or created if it doesn’t exist. If False (default), the file given in the path will be overwritten if it already exists.

bom
bool

If True, then insert the byte-order mark into the output file (the option is False by default). Even if the option is True, the BOM will not be written when appending data to an existing file.

According to Unicode standard, including BOM into text files is “neither required nor recommended”. However, some programs (e.g. Excel) may not be able to recognize file encoding without this mark.

hex
bool

If True, then all floating-point values will be printed in hex format (equivalent to %a format in C printf). This format is around 3 times faster to write/read compared to usual decimal representation, so its use is recommended if you need maximum speed.

compression
None | "gzip" | "auto"

Which compression method to use for the output stream. The default is “auto”, which tries to infer the compression method from the output file’s name. The only compression format currently supported is “gzip”. Compression may not be used when append is True.

verbose
bool

If True, some extra information will be printed to the console, which may help to debug the inner workings of the algorithm.

method
"mmap" | "write" | "auto"

Which method to use for writing to disk. On certain systems ‘mmap’ gives a better performance; on other OSes ‘mmap’ may not work at all.

return
None | str | bytes

None if path is non-empty. This is the most common case: the output is written to the file provided.

String containing the CSV text as if it would have been written to a file, if the path is empty or None. If the compression is turned on, a bytes object will be returned instead.

datatable.Frame.to_dict()

Convert the frame into a dictionary of lists, by columns.

In Python 3.6+ the order of records in the dictionary will be the same as the order of columns in the frame.

Parameters
return
Dict[str, List]

Dictionary with .ncols records. Each record represents a single column: the key is the column’s name, and the value is the list with the column’s data.

Examples
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"]) DT.to_dict()
{"A": [1, 2, 3], "B": ["aye", "nay", "tain"]}
See also
  • .to_list(): convert the frame into a list of lists

  • .to_tuples(): convert the frame into a list of tuples by rows

datatable.Frame.to_jay()

to_jay
(
path=None,
method='auto'
)

Save this frame to a binary file on disk, in .jay format.

Parameters
path
str | None

The destination file name. Although not necessary, we recommend using extension “.jay” for the file. If the file exists, it will be overwritten. If this argument is omitted, the file will be created in memory instead, and returned as a bytes object.

method
'mmap' | 'write' | 'auto'

Which method to use for writing the file to disk. The “write” method is more portable across different operating systems, but may be slower. This parameter has no effect when path is omitted.

return
None | bytes

If the path parameter is given, this method returns nothing. However, if path was omitted, the return value is a bytes object containing encoded frame’s data.

datatable.Frame.to_list()

Convert the frame into a list of lists, by columns.

Parameters
return
List[List]

A list of .ncols lists, each inner list representing one column of the frame.

Examples
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"]) DT.to_list()
[[1, 2, 3], ["aye", "nay", "tain"]]
dt.Frame(id=range(10)).to_list()
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]

datatable.Frame.to_numpy()

Convert frame into a 2D numpy array, optionally forcing it into the specified type.

In a limited set of circumstances the returned numpy array will be created as a data view, avoiding copying the data. This happens if all of these conditions are met:

  • the frame has only 1 column, which is not virtual;

  • the column’s type is not string;

  • the type argument was not used.

In all other cases the returned numpy array will have a copy of the frame’s data. If the frame has multiple columns of different stypes, then the values will be upcasted into the smallest common stype.

If the frame has any NA values, then the returned numpy array will be an instance of numpy.ma.masked_array.

Parameters
type
Type | <type-like>

Cast frame into this type before converting it into a numpy array. Here “type-like” can be any value that is acceptable to the dt.Type constructor.

column
int

Convert a single column instead of the whole frame. This column index can be negative, indicating columns counted from the end of the frame.

return
numpy.ndarray | numpy.ma.core.MaskedArray

The returned array will be 2-dimensional with the same .shape as the original frame. However, if the option column was used, then the returned array will be 1-dimensional with the length of .nrows.

A masked array is returned if the frame contains NA values but the corresponding numpy array does not support NAs.

except
ImportError

If the numpy module is not installed.

datatable.Frame.to_pandas()

Convert this frame into a pandas DataFrame.

If the frame being converted has one or more key columns, those columns will become the index in the pandas DataFrame.

Parameters
return
pandas.DataFrame

Pandas dataframe of shape (nrows, ncols-nkeys).

except
ImportError

If the pandas module is not installed.

datatable.Frame.to_tuples()

Convert the frame into a list of tuples, by rows.

Parameters
return
List[Tuple]

Returns a list having .nrows tuples, where each tuple has length .ncols and contains data from each respective row of the frame.

Examples
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"]) DT.to_tuples()
[(1, "aye"), (2, "nay"), (3, "tain")]

datatable.Frame.type

Added in version v1.0.0

The common dt.Type for all columns.

This property is well-defined only for frames where all columns have the same type.

Parameters
return
Type | None

For frames where all columns have the same type, this common type is returned. If a frame has 0 columns, None will be returned.

except
InvalidOperationError

This exception will be raised if the columns in the frame have different types.

See also
  • .types – list of types for all columns.

datatable.Frame.types

Added in version v1.0.0

The list of Types for each column of the frame.

Parameters
return
List[Type]

The length of the list is the same as the number of columns in the frame.

See also
  • .type – common type for all columns

datatable.Frame.view()

Warning

This function is currently not working properly. [#2669]

datatable.ltype

class
ltype
Deprecated since version 1.0.0

This class is deprecated and will be removed in version 1.2.0. Please use dt.Type instead.

Enumeration of possible “logical” types of a column.

Logical type is the type stripped away from the details of its physical storage. For example, ltype.int represents an integer. Under the hood, this integer can be stored in several “physical” formats: from stype.int8 to stype.int64. Thus, there is a one-to-many relationship between ltypes and stypes.

Values

The following ltype values are currently available:

  • ltype.bool

  • ltype.int

  • ltype.real

  • ltype.str

  • ltype.time

  • ltype.obj

Methods

ltype(x)

Find ltype corresponding to value x.

.stypes()

The list of dt.stypes that correspond to this ltype.

Examples

dt.ltype.bool
ltype.bool
dt.ltype("int32")
ltype.int

For each ltype, you can find the set of stypes that correspond to it:

dt.ltype.real.stypes
[stype.float32, stype.float64]
dt.ltype.time.stypes
[]

datatable.ltype.__new__()

Find an ltype corresponding to value.

This method is similar to dt.stype.__new__(), except that it returns an ltype instead of an stype.

datatable.ltype.stypes()

List of stypes that represent this ltype.

Parameters
return
List[stype]

datatable.Namespace

class
Namespace

A namespace is an environment that provides lazy access to columns of a frame when performing computations within DT[i,j,...].

This class should not be instantiated directly, instead use the singleton instances f and g exported from the datatable module.

Special methods

.__getattribute__(attr)

Access columns as attributes.

.__getitem__(item)

Access columns by their names / indices.

datatable.Namespace.__getitem__()

Retrieve column(s) by their indices/names/types.

By “retrieve” we actually mean that an expression is created such that when that expression is used within the DT[i,j] call, it would locate and return the specified column(s).

Parameters
items
int | str | slice | None | type | stype | ltype | list | tuple

The column selector:

int

Retrieve the column at the specified index. For example, f[0] denotes the first column, while f[-1] is the last.

str

Retrieve a column by name.

slice

Retrieve a slice of columns from the namespace. Both integer and string slices are supported.

Note that for string slicing, both the start and stop column names are included, unlike integer slicing, where the stop value is not included. Have a look at the examples below for more clarity.

None

Retrieve no columns (an empty columnset).

type | stype | ltype

Retrieve columns matching the specified type.

list/tuple

Retrieve columns matching the column names/column positions/column types within the list/tuple.

For example, f[0, -1] will return the first and last columns. Have a look at the examples below for more clarity.

return
FExpr

An expression that selects the specified column from a frame.

See also
Notes
Changed in version 1.0.0

f-expressions containing a list/tuple of column names/column positions/column types are accepted within the j selector.

Examples
from datatable import dt, f, by df = dt.Frame({'A': [1, 2, 3, 4], 'B': ["tolu", "sammy", "ogor", "boondocks"], 'C': [9.0, 10.0, 11.0, 12.0]}) df
ABC
int32str32float64
01tolu9
12sammy10
23ogor11
34boondocks12

Select by column position:

df[:, f[0]]
A
int32
01
12
23
34

Select by column name:

df[:, f["A"]]
A
int32
01
12
23
34

Select a slice:

df[:, f[0 : 2]]
AB
int32str32
01tolu
12sammy
23ogor
34boondocks

Slicing with column names:

df[:, f["A" : "C"]]
ABC
int32str32float64
01tolu9
12sammy10
23ogor11
34boondocks12

Note

For string slicing, both the start and stop are included; for integer slicing the stop is not included.

Select by data type:

df[:, f[dt.str32]]
B
str32
0tolu
1sammy
2ogor
3boondocks
df[:, f[float]]
C
float64
09
110
211
312

Select a list/tuple of columns by position:

df[:, f[0, 1]]
AB
int32str32
01tolu
12sammy
23ogor
34boondocks

Or by column names:

df[:, f[("A", "B")]]
AB
int32str32
01tolu
12sammy
23ogor
34boondocks

Note that in the code above, the parentheses are unnecessary, since tuples in python are defined by the presence of a comma. So the below code works as well:

df[:, f["A", "B"]]
AB
int32str32
01tolu
12sammy
23ogor
34boondocks

Select a list/tuple of data types:

df[:, f[int, float]]
AC
int32float64
019
1210
2311
3412

Passing None within an f-expressions returns an empty columnset:

df[:, f[None]]
0
1
2
3

datatable.Namespace.__getattribute__()

Retrieve a column from the namespace by name.

This is a convenience form that can be used to access simply-named columns. For example: f.Age denotes a column called "Age", and is exactly equivalent to f['Age'].

Parameters
name
str

Name of the column to select.

return
FExpr

An expression that selects the specified column from a frame.

See also

datatable.stype

class
stype
Deprecated since version 1.0.0

This class is deprecated and will be removed in version 1.2.0. Please use dt.Type instead.

Enumeration of possible “storage” types of columns in a Frame.

Each column in a Frame is a vector of values of the same type. We call this column’s type the “stype”. Most stypes correspond to primitive C types, such as int32_t or double. However some stypes (corresponding to strings and categoricals) have a more complicated underlying structure.

Notably, datatable does not support arbitrary structures as elements of a Column, so the set of stypes is small.

Values

The following stype values are currently available:

  • stype.bool8

  • stype.int8

  • stype.int16

  • stype.int32

  • stype.int64

  • stype.float32

  • stype.float64

  • stype.str32

  • stype.str64

  • stype.obj64

They are available either as properties of the dt.stype class, or directly as constants in the dt. namespace. For example:

dt.stype.int32
stype.int32
dt.int64
stype.int64

Methods

stype(x)

Find stype corresponding to value x.

<stype>(col)

Cast a column into the specific stype.

.ctype

ctypes type corresponding to this stype.

.dtype

numpy dtype corresponding to this stype.

.ltype

dt.ltype corresponding to this stype.

.struct

struct string corresponding to this stype.

.min

The smallest numeric value for this stype.

.max

The largest numeric value for this stype.

datatable.stype.__call__()

Cast column col into the new stype.

An stype can be used as a function that converts columns into that specific stype. In the same way as you could write int(3.14) in Python to convert a float value into integer, you can likewise write dt.int32(f.A) to convert column A into stype int32.

Parameters
col
FExpr

A single- or multi- column expression. All columns will be converted into the desired stype.

return
FExpr

Expression that converts its inputs into the current stype.

Examples
from datatable import dt, f df = dt.Frame({'A': ['1', '1', '2', '1', '2'], 'B': [None, '2', '3', '4', '5'], 'C': [1, 2, 1, 1, 2]}) df
ABC
str32str32int32
01NA1
1122
2231
3141
4252

Convert column A from string stype to integer stype:

df[:, dt.int32(f.A)]
A
int32
01
11
22
31
42

Convert multiple columns to different stypes:

df[:, [dt.int32(f.A), dt.str32(f.C)]]
AC
int32str32
011
112
221
311
422
See Also
dt.as_type() – equivalent method of casting a column into

another stype.

datatable.stype.__new__()

Find an stype corresponding to value.

This method is called when you attempt to construct a new dt.stype object, for example dt.stype(int). Instead of actually creating any new stypes, we return one of the existing stype values.

Parameters
value
str | type | np.dtype

An object that will be converted into an stype. This could be a string such as "integer" or "int" or "int8", a python type such as bool or float, or a numpy dtype.

return
stype

An dt.stype that corresponds to the input value.

except
ValueError

Raised if value does not correspond to any stype.

Examples
dt.stype(str)
stype.str64
dt.stype("double")
stype.float64
dt.stype(numpy.dtype("object"))
stype.obj64
dt.stype("int64")
stype.int64

datatable.stype.ctype

ctypes class that describes the C-level type of each element in a column with this stype.

For non-fixed-width columns (such as str32) this will return the ctype of only the fixed-width component of that column. Thus, stype.str32.ctype == ctypes.c_int32.

datatable.stype.dtype

numpy.dtype object that corresponds to this stype.

datatable.stype.ltype

dt.ltype corresponding to this stype. Several stypes may map to the same ltype, whereas each stype is described by exactly one ltype.

datatable.stype.max

The largest finite value that this stype can represent.

datatable.stype.min

The smallest finite value that this stype can represent.

datatable.stype.struct

struct format string corresponding to this stype.

For non-fixed-width columns (such as str32) this will return the format string of only the fixed-width component of that column. Thus, stype.str32.struct == '=i'.

datatable.Type

Added in version 1.0.0

Type of data stored in a single column of a Frame.

The type describes both the logical meaning of the data (i.e. an integer, a floating point number, a string, etc.), as well as storage requirement of that data (the number of bits per element). Some types may carry additional properties, such as a timezone or precision.

Note

This property replaces previous dt.stype and dt.ltype.

Properties

.max

The maximum value for this type

.min

The minimum value for this type

.name

The name of this type

datatable.Type.bool8

The type of a column with boolean data.

In a column of this type each data element is stored as 1 byte. NA values are also supported.

The boolean type is considered numeric, where True is 1 and False is 0.

Examples
DT = dt.Frame([True, False, None]).type DT.type
Type.bool8
DT
C0
bool8
01
10
2NA

datatable.Type.date32

Added in version 1.0.0

The date32 type represents a particular calendar date without a time component. Internally, this type is stored as a 32-bit signed integer counting the number of days since 1970-01-01 (“the epoch”). Thus, this type accommodates dates within the range of approximately ±5.8 million years.

The calendar used for this type is proleptic Gregorian, meaning that it extends the modern-day Gregorian calendar into the past before this calendar was first adopted, and into the future, long after it will have been replaced.

This type corresponds to datetime.date in Python, pa.date32() in pyarrow, and np.dtype('<M8[D]') in numpy.

Note

Python’s datetime.date object can accommodate dates from year 1 to year 9999, which is much smaller than what our date32 type allows. As a consequence, when date32 values that are outside of year range 1-9999 are converted to python, they become integers instead of datetime.date objects.

For the same reason the .min and .max properties of this type also return integers.

Examples
from datetime import date DT = dt.Frame([date(2020, 1, 30), date(2020, 3, 11), None, date(2021, 6, 15)]) DT.type
Type.date32
DT
C0
date32
02020-01-30
12020-03-11
2NA
32021-06-15
dt.Type.date32.min
-2147483647
dt.Type.date32.max
2146764179
dt.Frame([dt.Type.date32.min, date(2021, 6, 15), dt.Type.date32.max], stype='date32')
C0
date32
0-5877641-06-24
12021-06-15
25879610-09-09

datatable.Type.float32

Single-precision floating point type. This corresponds to C type float. Each element of this type is 4 bytes long.

datatable.Type.float64

Double-precision IEEE-754 floating point type. This corresponds to C type double. Each element of this type is 8 bytes long.

datatable.Type.int8

Integer type that uses 1 byte per data element and can store values in the range -127 .. 127.

This type corresponds to int8_t in C/C++, int in python, np.dtype('int8') in numpy, and pa.int8() in pyarrow.

Most arithmetic operations involving this type will produce a result of type int32, which follows the convention of the C language.

Examples
dt.Type.int8
Type.int8
dt.Type('int8')
Type.int8
dt.Frame([1, 0, 1, 1, 0]).types
[Type.int8]

datatable.Type.int16

Integer type, corresponding to int16_t in C. This type uses 2 bytes per data element, and can store values in the range -32767 .. 32767.

Most arithmetic operations involving this type will produce a result of type int32, which follows the convention of the C language.

datatable.Type.int32

Integer type, corresponding to int32_t in C. This type uses 4 bytes per data element, and can store values in the range -2,147,483,647 .. 2,147,483,647.

This is the most common type for handling integer data. When a python list of integers is converted into a Frame, a column of this type will usually be created.

Examples
DT = dt.Frame([None, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]) DT
C0
int32
0NA
11
21
32
43
55
68
713
821
934
1055

datatable.Type.int64

Integer type which corresponds to int64_t in C. This type uses 8 bytes per data element, and can store values in the range -(2**63-1) .. (2**63-1).

datatable.Type.obj64

This type can be used to store arbitrary Python objects.

datatable.Type.str32

The type of a column with string data.

Internally, this column stores data using 2 buffers: a character buffer, where all values in the column are stored together as a single large concatenated string in UTF8 encoding, and an int32 array of offsets into the character buffer. Consequently, this type can only store up to 2Gb of total character data per column.

Whenever any operation on a string column exceeds the 2Gb limit, this type will be silently replaced with dt.Type.str64.

A virtual column that produces string data may have either str32 or str64 type regardless of how it stores its data.

This column converts to str type in Python, pa.string() in pyarrow, and dtype('object') in numpy and pandas.

Examples
DT = dt.Frame({"to persist": ["one teaspoon", "at a time,", "the rain turns", "mountains", "into valleys"]}) DT
to persist
str32
0one teaspoon
1at a time,
2the rain turns
3mountains
4into valleys

datatable.Type.str64

String type, where the offsets buffer uses 64-bit integers.

datatable.Type.time64

Added in version 1.0.0

The time64 type is used to represent a specific moment in time. This corresponds to datetime in Python, or timestamp in Arrow or pandas. Internally, this type is stored as a 64-bit integer containing the number of nanoseconds since the epoch (Jan 1, 1970) in UTC.

This type is not leap-seconds aware, meaning that it assumes that each day has exactly 24×3600 seconds. In practice it means that calculating time difference between two time64 moments may be off by the number of leap seconds that have occurred between them.

Currently, time64 type is not timezone-aware, addition of time zones is planned for the next release.

A time64 column converts into datetime.datetime objects in python, a pa.timestamp('ns') type in pyarrow and dtype('datetime64[ns]') in numpy and pandas.

Examples
DT = dt.Frame(["2018-01-31 03:16:57", "2021-06-15 15:44:23.951", None, "1965-11-25 19:29:00"]) DT[0] = dt.Type.time64 DT
C0
time64
02018-01-31T03:16:57
12021-06-15T15:44:23.951
2NA
31965-11-25T19:29:00
dt.Type.time64.min
datetime.datetime(1677, 9, 22, 0, 12, 43, 145225)
dt.Type.time64.max
datetime.datetime(2262, 4, 11, 23, 47, 16, 854775)

datatable.Type.void

The type of a column where all values are NAs.

In datatable, any column can have NA values in it. There is, however, a special type that can be assigned for a column where all values are NAs: void. This type’s special property is that it can be used in place where any other type could be expected.

A column of this type does not occupy any storage space. Unlike other types, it does not use the validity buffer either: all values are known to be invalid.

It converts into pyarrow’s pa.null() type, or '|V0' dtype in numpy.

Examples
DT = dt.Frame([None, None, None]) DT.type
Type.void
DT
C0
void
0NA
1NA
2NA

datatable.Type.max

The largest finite value that this type can represent, if applicable.

Parameters
return
Any

The type of the returned value corresponds to the Type object: an int for integer types, a float for floating-point types, etc. If the type has no well-defined max value then None is returned.

Examples
dt.Type.int32.max
2147483647
dt.Type.float64.max
1.7976931348623157e+308
dt.Type.date32.max
2146764179
See also

.min – the smallest value for the type.

datatable.Type.min

The smallest finite value that this type can represent, if applicable.

Parameters
return
Any

The type of the returned value corresponds to the Type object: an int for integer types, a float for floating-point types, etc. If the type has no well-defined min value then None is returned.

Examples
dt.Type.int8.min
-127
dt.Type.float32.min
-3.4028234663852886e+38
dt.Type.date32.min
-2147483647
See also

.max – the largest value for the type.

datatable.Type.name

Return the canonical name of this type, as a string.

Examples
dt.Type.int64.name
'int64'
dt.Type(np.bool_).name
'bool8'

datatable.as_type()

Added in version 1.0

Convert columns cols into the prescribed stype.

This function does not modify the data in the original column. Instead it returns a new column which converts the values into the new type on the fly.

Parameters

cols
FExpr

Single or multiple columns that need to be converted.

new_type
Type | stype

Target type.

return
FExpr

The output will have the same number of rows and columns as the input; column names will be preserved too.

Examples

from datatable import dt, f, as_type df = dt.Frame({'A': ['1', '1', '2', '1', '2'], 'B': [None, '2', '3', '4', '5'], 'C': [1, 2, 1, 1, 2]}) df
ABC
str32str32int32
01NA1
1122
2231
3141
4252

Convert column A from string to integer type:

df[:, as_type(f.A, int)]
A
int64
01
11
22
31
42

The exact dtype can be specified:

df[:, as_type(f.A, dt.Type.int32)]
A
int32
01
11
22
31
42

Convert multiple columns to different types:

df[:, [as_type(f.A, int), as_type(f.C, dt.str32)]]
AC
int64str32
011
112
221
311
422

datatable.build_info

build_info

This is a python struct that contains information about the installed datatable module. The following fields are available:

.version
str

The version string of the current build. Several formats of the version string are possible:

  • {MAJOR}.{MINOR}.{MICRO} – the release version string, such as "0.11.0".

  • {RELEASE}a{DEVNUM} – version string for the development build of datatable, where {RELEASE} is the normal release string and {DEVNUM} is an integer that is incremented with each build. For example: "0.11.0a1776".

  • {RELEASE}a0+{SUFFIX} – version string for a PR build of datatable, where the {SUFFIX} is formed from the PR number and the build sequence number. For example, "0.11.0a0+pr2602.13".

  • {RELEASE}a0+{FLAVOR}.{TIMESTAMP}.{USER} – version string used for local builds. This contains the “flavor” of the build, such as normal build, or debug, or coverage, etc; the unix timestamp of the build; and lastly the system user name of the user who made the build.

.build_date
str

UTC timestamp (date + time) of the build.

.build_mode
str

The type of datatable build. Usually this will be "release", but may also be "debug" if datatable was built in debug mode. Other build modes exist, or may be added in the future.

.compiler
str

The version of the compiler used to build the C++ datatable extension. This will include both the name and the version of the compiler.

.git_revision
str

Git-hash of the revision from which the build was made, as obtained from git rev-parse HEAD.

.git_branch
str

Name of the git branch from where the build was made. This will be obtained from environment variable CHANGE_BRANCH if defined, or from command git rev-parse --abbrev-ref HEAD otherwise.

.git_date
str
Added in version 0.11

Timestamp of the git commit from which the build was made.

.git_diff
str
Added in version 0.11

If the source tree contains any uncommitted changes (compared to the checked out git revision), then the summary of these changes will be in this field, as reported by git diff HEAD --stat --no-color. Otherwise, this field is an empty string.

datatable.by()

Group-by clause for use in Frame’s square-bracket selector.

Whenever a by() object is present inside a DT[i, j, ...] expression, it makes all other expressions to be evaluated in group-by mode. This mode causes the following changes to the evaluation semantics:

  • A “Groupby” object will be computed for the frame DT, grouping it by columns specified as the arguments to the by() call. This object keeps track of which rows of the frame belong to which group.

  • If an i expression is present (row filter), it will be interpreted within each group. For example, if i is a slice, then the slice will be applied separately to each group. Similarly, if i expression contains a formula with reduce functions, then those functions will be evaluated for each group. For example:

    DT[f.A == max(f.A), :, by(f.group_id)]

    will select those rows where column A reaches its peak value within each group (there could be multiple such rows within each group).

  • Before j is evaluated, the by() clause adds all its columns at the start of j (unless add_columns argument is False). If j is a “select-all” slice (i.e. :), then those columns will also be excluded from the list of all columns so that they will be present in the output only once.

  • During evaluation of j, the reducer functions, such as min(), sum(), etc, will be evaluated by-group, that is they will find the minimal value in each group, the sum of values in each group, and so on. If a reducer expression is combined with a regular column expression, then the reduced column will be auto-expanded into a column that is constant within each group.

  • Note that if both i and j contain reducer functions, then those functions will have a slightly different notion of groups: the reducers in i will see each group “in full”, whereas the reducers in j will see each group after it was filtered by the expression in i (and possibly not even see some of the groups at all, if they were filtered out completely).

  • If j contains only reducer expressions, then the final result will be a Frame containing just a single row for each group. This resulting frame will also be keyed by the grouped-by columns.

The by() function expects a single column or a sequence of columns as the argument(s). It accepts either a column name, or an f-expression. In particular, you can perform a group-by on a dynamically computed expression:

DT[:, :, by(dt.math.floor(f.A/100))]

The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values. As a special case, if you group by an expression -f.A, then it will be treated as if you requested to group by the column “A” sorting it in the descending order. This will work even with column types that are not arithmetic, for example “A” could be a string column here.

Examples

from datatable import dt, f, by df = dt.Frame({"group1": ["A", "A", "B", "B", "A"], "group2": [1, 0, 1, 1, 1], "var1": [343, 345, 567, 345, 212]}) df
group1group2var1
str32int8int32
0A1343
1A0345
2B1567
3B1345
4A1212

Group by a single column:

df[:, dt.count(), by("group1")]
group1count
str32int64
0A3
1B2

Group by multiple columns:

df[:, dt.sum(f.var1), by("group1", "group2")]
group1group2var1
str32int8int64
0A0345
1A1555
2B1912

Return grouping result without the grouping column(s) by setting the add_columns parameter to False:

df[:, dt.sum(f.var1), by("group1", "group2", add_columns=False)]
var1
int64
0345
1555
2912

f-expressions can be passed to by():

df[:, dt.count(), by(f.var1 < 400)]
C0count
bool8int64
001
114

By default, the groups are sorted in ascending order. The inverse is possible by negating the f-expressions in by():

df[:, dt.count(), by(-f.group1)]
group1count
str32int64
0B2
1A3

An integer can be passed to the i section:

df[0, :, by("group1")]
group1group2var1
str32int8int32
0A1343
1B1567

A slice is also acceptable within the i section:

df[-1:, :, by("group1")]
group1group2var1
str32int8int32
0A1212
1B1345

Note

f-expressions is not implemented yet for the i section in a groupby. Also, a sequence cannot be passed to the i section in the presence of by().

See Also

datatable.cbind()

Create a new Frame by appending columns from several frames.

This function is exactly equivalent to:

dt.Frame().cbind(*frames, force=force)

Parameters

frames
Frame | List[Frame] | None
force
bool
return
Frame

See also

  • rbind() – function for row-binding several frames.

  • dt.Frame.cbind() – Frame method for cbinding some frames to another.

Examples

from datatable import dt, f DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0]) DT
AB
int32int32
014
127
230
frame1 = dt.Frame(N=[-1, -2, -5]) frame1
N
int32
0-1
1-2
2-5
dt.cbind([DT, frame1])
ABN
int32int32int32
014-1
127-2
230-5

If the number of rows are not equal, you can force the binding by setting the force parameter to True:

frame2 = dt.Frame(N=[-1, -2, -5, -20]) frame2
N
int32
0-1
1-2
2-5
3-20
dt.cbind([DT, frame2], force=True)
ABN
int32int32int32
014-1
127-2
230-5
3NANA-20

datatable.corr()

Calculate the Pearson correlation between col1 and col2.

Parameters

col1
,
col2
Expr

Input columns.

return
Expr

f-expression having one row, one column and the correlation coefficient as the value. If one of the columns is non-numeric, the value is NA. The column stype is float32 if both col1 and col2 are float32, and float64 in all the other cases.

Examples

from datatable import dt, f DT = dt.Frame(A = [0, 1, 2, 3], B = [0, 2, 4, 6]) DT
AB
int32int32
000
112
224
336
DT[:, dt.corr(f.A, f.B)]
C0
float64
01

See Also

  • cov() – function to calculate covariance between two columns.

datatable.count()

Calculate the number of non-missing values for each column from cols.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. All the returned column stypes are int64.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric and non-string type.

See Also

  • sum() – function to calculate the sum of values.

Examples

from datatable import dt, f df = dt.Frame({'A': [1, 1, 2, 1, 2], 'B': [None, 2, 3,4, 5], 'C': [1, 2, 1, 1, 2]}) df
ABC
int32int32int32
01NA1
1122
2231
3141
4252

Get the count of all rows:

df[:, dt.count()]
count
int32
05

Get the count of column B (note how the null row is excluded from the count result):

df[:, dt.count(f.B)]
B
int64
04

datatable.cov()

Calculate covariance between col1 and col2.

Parameters

col1
,
col2
Expr

Input columns.

return
Expr

f-expression having one row, one column and the covariance between col1 and col2 as the value. If one of the input columns is non-numeric, the value is NA. The output column stype is float32 if both col1 and col2 are float32, and float64 in all the other cases.

Examples

from datatable import dt, f DT = dt.Frame(A = [0, 1, 2, 3], B = [0, 2, 4, 6]) DT
AB
int32int32
000
112
224
336
DT[:, dt.cov(f.A, f.B)]
C0
float64
03.33333

See Also

  • corr() – function to calculate correlation between two columns.

datatable.cut()

Added in version 0.11

For each column from cols bin its values into equal-width intervals, when nbins is specified, or into arbitrary-width intervals, when interval edges are provided as bins.

Parameters

cols
FExpr

Input data for equal-width interval binning.

nbins
int | List[int]

When a single number is specified, this number of bins will be used to bin each column from cols. When a list or a tuple is provided, each column will be binned by using its own number of bins. In the latter case, the list/tuple length must be equal to the number of columns in cols.

bins
List[Frame]

List/tuple of single-column frames containing interval edges in strictly increasing order, that will be used for binning of the corresponding columns from cols. The list/tuple length must be equal to the number of columns in cols.

right_closed
bool

Each binning interval is half-open. This flag indicates whether the right edge of the interval is closed, or not.

return
FExpr

f-expression that converts input columns into the columns filled with the respective bin ids.

See also

qcut() – function for equal-population binning.

datatable.dt

This is the datatable module itself.

The purpose of exporting this symbol is so that you can easily import all the things you need from the datatable module in one go:

from datatable import dt, f, g, by, join, mean

Note: while it is possible to write

test = dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('test.jay') train = dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('train.jay')

we do not in fact recommend doing so (except possibly on April 1st).

datatable.f

The main Namespace object.

The function of this object is that during the evaluation of a DT[i,j] call, the variable f represents the columns of frame DT.

Specifically, within expression DT[i, j] the following is true:

  • f.A means “column A” of frame DT;

  • f[2] means “3rd colum” of frame DT;

  • f[int] means “all integer columns” of DT;

  • f[:] means “all columns” of DT.

See also

  • g – namespace for joined frames.

datatable.first()

Return the first row for each column from cols.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names, stypes and number of columns as in cols.

Examples

first() returns the first column in a frame:

from datatable import dt, f, by, sort, first df = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, 5]}) df
AB
01NA
112
223
314
425
dt.first(df)
A
01
11
22
31
42

Within a frame, it returns the first row:

df[:, first(f[:])]
AB
01NA

Of course, you can replicate this by passing 0 to the i section instead:

df[0, :]
AB
01NA

first() comes in handy if you wish to get the first non null value in a column:

df[f.B != None, first(f.B)]
B
02

first() returns the first row per group in a by() operation:

df[:, first(f[:]), by("A")]
AB
01NA
123

To get the first non-null value per row in a by() operation, you can use the sort() function, and set the na_position argument as last:

df[:, first(f[:]), by("A"), sort("B", na_position="last")]
AB
012
123

See Also

  • last() – function that returns the last row.

datatable.fread()

fread
(
anysource=None,
*,
file=None,
text=None,
cmd=None,
url=None,
columns=None,
sep=None,
dec=".",
max_nrows=None,
header=None,
na_strings=None,
verbose=False,
fill=False,
encoding=None,
quotechar='"',
tempdir=None,
nthreads=None,
logger=None,
)

This function is capable of reading data from a variety of input formats, producing a Frame as the result. The recognized formats are: CSV, Jay, XLSX, and plain text. In addition, the data may be inside an archive such as .tar, .gz, .zip, .gz2, and .tgz.

Parameters

anysource
str | bytes | file | Pathlike | List

The first (unnamed) argument to fread is the input source. Multiple types of sources are supported, and they can be named explicitly: file, text, cmd, and url. When the source is not named, fread will attempt to guess its type. The most common type is file, but sometimes the argument is resolved as text (if the string contains newlines) or url (if the string starts with https:// or similar).

Only one argument out of anysource, file, text, cmd or url can be specified at once.

file
str | file | Pathlike

A file source can be either the name of the file on disk, or a python “file-like” object – i.e. any object having method .read().

Generally, specifying a file name should be preferred, since reading from a Python file can only be done in single-threaded mode.

This argument also supports addressing files inside an archive, or sheets inside an Excel workbook. Simply write the name of the file as if the archive was a folder: "data.zip/train.csv".

text
str | bytes

Instead of reading data from file, this argument provides the data as a simple in-memory blob.

cmd
str

A command that will be executed in the shell and its output then read as text.

url
str

This parameter can be used to specify the URL of the input file. The data will first be downloaded into a temporary directory and then read from there. In the end the temporary files will be removed.

We use the standard urllib.request module to download the data. Changing the settings of that module, for example installing proxy, password, or cookie managers will allow you to customize the download process.

columns
...

Limit which columns to read from the input file.

sep
str | None

Field separator in the input file. If this value is None (default) then the separator will be auto-detected. Otherwise it must be a single-character string. When sep='\n', then the data will be read in single-column mode. Characters ["'`0-9a-zA-Z] are not allowed as the separator, as well as any non-ASCII characters.

dec
"." | ","

Decimal point symbol for floating-point numbers.

max_nrows
int

The maximum number of rows to read from the file. Setting this parameter to any negative number is equivalent to have no limit at all. Currently this parameter doesn’t always work correctly.

na_strings
List[str]

The list of strings that were used in the input file to represent NA values.

fill
bool

If True then the lines of the CSV file are allowed to have uneven number of fields. All missing fields will be filled with NAs in the resulting frame.

encoding
str | None

If this parameter is provided, then the input will be recoded from this encoding into UTF-8 before reading. Any encoding registered with the python codec module can be used.

skip_to_string
str | None

Start reading the file from the line containing this string. All previous lines will be skipped and discarded. This parameter cannot be used together with skip_to_line.

skip_to_line
int

If this setting is given, then this many lines in the file will be skipped before we start to parse the file. This can be used for example when several first lines in the file contain non-CSV data and therefore must be skipped. This parameter cannot be used together with skip_to_string.

skip_blank_lines
bool

If True, then any empty lines in the input will be skipped. If this parameter is False then: (a) in single-column mode empty lines are kept as empty lines; otherwise (b) if fill=True then empty lines produce a single line filled with NAs in the output; otherwise (c) an dt.exceptions.IOError is raised.

strip_whitespace
bool

If True, then the leading/trailing whitespace will be stripped from unquoted string fields. Whitespace is always skipped from numeric fields.

quotechar
'"' | "'" | "`"

The character that was used to quote fields in the CSV file. By default the double-quote mark '"' is assumed.

tempdir
str | None

Use this directory for storing temporary files as needed. If not provided then the system temporary directory will be used, as determined via the tempfile Python module.

nthreads
int | None

Number of threads to use when reading the file. This number cannot exceed the number of threads in the pool dt.options.nthreads. If 0 or negative number of threads is requested, then it will be treated as that many threads less than the maximum. By default all threads in the thread pool are used.

verbose
bool

If True, then print detailed information about the internal workings of fread to stdout (or to logger if provided).

logger
object

Logger object that will receive verbose information about fread’s progress. When this parameter is specified, verbose mode will be turned on automatically.

multiple_sources
"warn" | "error" | "ignore"

Action that should be taken when the input resolves to multiple distinct sources. By default, ("warn") a warning will be issued and only the first source will be read and returned as a Frame. The "ignore" action is similar, except that the extra sources will be discarded without a warning. Lastly, an dt.exceptions.IOError can be raised if the value of this parameter is "error".

If you want all sources to be read instead of only the first one then consider using iread().

memory_limit
int

Try not to exceed this amount of memory allocation (in bytes) when reading the data. This limit is advisory and not enforced very strictly.

This setting is useful when reading data from a file that is substantially larger than the amount of RAM available on your machine.

When this parameter is specified and fread sees that it needs more RAM than the limit in order to read the input file, then it will dump the data that was read so far into a temporary file in binary format. In the end the returned Frame will be partially composed from data located on disk, and partially from the data in memory. It is advised to either store this data as a Jay file or filter and materialize the frame (if not the performance may be slow).

return
Frame

A single Frame object is always returned.

Changed in version 0.11.0

Previously, a dict of Frames was returned when multiple input sources were provided.

except
dt.exceptions.IOError

See Also

datatable.g

Secondary Namespace object.

The function of this object is that during the evaluation of a DT[..., join(X)] call, the variable g represents the columns of the joined frame X. In SQL this would have been equivalent to ... JOIN tableX AS g ....

See also

  • f – main column namespace.

datatable.ifelse()

Added in version 0.11.0

An expression that chooses its value based on one or more conditions.

This is roughly equivalent to the following Python code:

result = value1 if condition1 else \ value2 if condition2 else \ ... else \ default

For every row this function evaluates the smallest number of expressions necessary to get the result. Thus, it evaluates condition1, condition2, and so on until it finds the first condition that evaluates to True. It then computes and returns the corresponding value. If all conditions evaluate to False, then the default value is computed and returned.

Also, if any of the conditions produces NA then the result of the expression also becomes NA without evaluating any further conditions or values.

Parameters

condition1
,
condition2
,
...
FExpr[bool]

Expressions each producing a single boolean column. These conditions will be evaluated in order until we find the one equal to True.

value1
,
value2
,
...
FExpr

Values that will be used when the corresponding condition evaluates to True. These must be single columns.

default
FExpr

Value that will be used when all conditions evaluate to False. This must be a single column.

return
FExpr

The resulting expression is a single column whose stype is the common stype for all value1, …, default columns.

Notes

Changed in version 1.0.0

Earlier this function accepted a single condition only.

Examples

Single condition

Task: Create a new column Colour, where if Set is 'Z' then the value should be 'Green', else 'Red':

from datatable import dt, f, by, ifelse, update df = dt.Frame("""Type Set A Z B Z B X C Y""") df[:, update(Colour = ifelse(f.Set == "Z", # condition "Green", # if condition is True "Red")) # if condition is False ] df
TypeSetColour
str32str32str32
0AZGreen
1BZGreen
2BXRed
3CYRed
Multiple conditions

Task: Create new column value whose value is taken from columns a, b, or c – whichever is nonzero first:

df = dt.Frame({"a": [0,0,1,2], "b": [0,3,4,5], "c": [6,7,8,9]}) df
abc
int32int32int32
0006
1037
2148
3259
df['value'] = ifelse(f.a > 0, f.a, # first condition and result f.b > 0, f.b, # second condition and result f.c) # default if no condition is True df
abcvalue
int32int32int32int32
00066
10373
21481
32592

datatable.init_styles()

init_styles
(
)

Inject datatable’s stylesheets into the Jupyter notebook. This function does nothing when it runs in a normal Python environment outside of Jupyter.

When datatable runs in a Jupyter notebook, it renders its Frames as HTML tables. The appearance of these tables is enhanced using a custom stylesheet, which must be injected into the notebook at any point on the page. This is exactly what this function does.

Normally, this function is called automatically when datatable is imported. However, in some circumstances Jupyter erases these stylesheets (for example, if you run import datatable cell twice). In such cases, you may need to call this method manually.

datatable.intersect()

Find the intersection of sets of values in the frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the intersection operation on these sets, returning those values that are present in each of the provided frames.

Parameters

*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns in the frames.

except
ValueError | NotImplementedError

dt.exceptions.ValueError

raised when one of the input frames has more than one column.

dt.exceptions.NotImplementedError

raised when one of the columns has stype obj64.

Examples

from datatable import dt s1 = dt.Frame([4, 5, 6, 20, 42]) s2 = dt.Frame([1, 2, 3, 5, 42]) s1
C0
int32
04
15
26
320
442
s2
C0
int32
01
12
23
35
442

Intersection of the two frames:

dt.intersect([s1, s2])
C0
int32
05
142

See Also

  • setdiff() – calculate the set difference between the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • union() – calculate the union of values in the frames.

  • unique() – find unique values in a frame.

datatable.iread()

iread
(
anysource=None,
*,
file=None,
text=None,
cmd=None,
url=None,
columns=None,
sep=None,
dec=".",
max_nrows=None,
header=None,
na_strings=None,
verbose=False,
fill=False,
encoding=None,
quotechar='"',
tempdir=None,
nthreads=None,
logger=None,
errors="warn",
)

This function is similar to fread(), but allows reading multiple sources at once. For example, this can be used when the input is a list of files, or a glob pattern, or a multi-file archive, or multi-sheet XLSX file, etc.

Parameters

...
...

Most parameters are the same as in fread(). All parse parameters will be applied to all input files.

errors
"warn" | "raise" | "ignore" | "store"

What action to take when one of the input sources produces an error. Possible actions are: "warn" – each error is converted into a warning and emitted to user, the source that produced the error is then skipped; "raise" – the errors are raised immediately and the iteration stops; "ignore" – the erroneous sources are silently ignored; "store" – when an error is raised, it is captured and returned to the user, then the iterator continues reading the subsequent sources.

return
Iterator[Frame] | Iterator[Frame|Exception]

The returned object is an iterator that produces Frame s. The iterator is lazy: each frame is read only as needed, after the previous frame was “consumed” by the user. Thus, the user can interrupt the iterator without having to read all the frames.

Each Frame produced by the iterator has a .source attribute that describes the source of each frame as best as possible. Each source depends on the type of the input: either a file name, or a URL, or the name of the file in an archive, etc.

If the errors parameter is "store" then the iterator may produce either Frames or exception objects.

See Also

datatable.join()

Join clause for use in Frame’s square-bracket selector.

This clause is equivalent to the SQL JOIN, though for the moment datatable only supports left outer joins. In order to join, the frame must be keyed first, and then joined to another frame DT as:

DT[:, :, join(X)]

provided that DT has the column(s) with the same name(s) as the key in frame.

Parameters

frame
Frame

An input keyed frame to be joined to the current one.

return
Join Object

In most of the cases the returned object is directly used in the Frame’s square-bracket selector.

except
ValueError

The exception is raised if frame is not keyed.

Examples

df1 = dt.Frame(""" date X1 X2 01-01-2020 H 10 01-02-2020 H 30 01-03-2020 Y 15 01-04-2020 Y 20""") df2 = dt.Frame("""X1 X3 H 5 Y 10""")

First, create a key on the right frame (df2). Note that the join key (X1) has unique values and has the same name in the left frame (df1):

df2.key = "X1"

Join is now possible:

df1[:, :, join(df2)]
dateX1X2X3
str32str32int32int32
001-01-2020H105
101-02-2020H305
201-03-2020Y1510
301-04-2020Y2010

You can refer to columns of the joined frame using prefix g., similar to how columns of the left frame can be accessed using prefix f.:

df1[:, update(X2=f.X2 * g.X3), join(df2)] df1
dateX1X2
str32str32int32
001-01-2020H50
101-02-2020H150
201-03-2020Y150
301-04-2020Y200

datatable.last()

Return the last row for each column from cols.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names, stypes and number of columns as in cols.

Examples

last() returns the last column in a frame:

from datatable import dt, f, by, sort, last df = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, None]}) df
AB
int32int32
01NA
112
223
314
42NA
dt.last(df)
B
int32
0NA
12
23
34
4NA

Within a frame, it returns the last row:

df[:, last(f[:])]
AB
int32int32
02NA

The above code can be replicated by passing -1 to the i section instead:

df[-1, :]
AB
int32int32
02NA

Like first(), last() can be handy if you wish to get the last non null value in a column:

df[f.B != None, dt.last(f.B)]
B
int32
04

last() returns the last row per group in a by() operation:

df[:, last(f[:]), by("A")]
AB
int32int32
014
12NA

To get the last non-null value per row in a by() operation, you can use the sort() function, and set the na_position argument as first (this will move the NAs to the top of the column):

df[:, last(f[:]), by("A"), sort("B", na_position="first")]
AB
int32int32
014
123

See Also

  • first() – function that returns the first row.

datatable.max()

Calculate the maximum value for each column from cols. It is recommended to use it as dt.max() to prevent conflict with the Python built-in max() function.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row and the same names, stypes and number of columns as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

Examples

from datatable import dt, f, by df = dt.Frame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'B': [3, 2, 20, 1, 6, 2, 3, 22, 1]}) df
AB
int32int32
013
112
2120
321
426
522
633
7322
831

Get the maximum from column B:

df[:, dt.max(f.B)]
B
int32
022

Get the maximum of all columns:

df[:, [dt.max(f.A), dt.max(f.B)]]
AB
int32int32
0322

Same as above, but more convenient:

df[:, dt.max(f[:])]
AB
int32int32
0322

In the presence of by(), it returns the row with the maximum value per group:

df[:, dt.max(f.B), by("A")]
AB
int32int32
0120
126
2322

See Also

  • min() – function to calculate minimum values.

datatable.mean()

Calculate the mean value for each column from cols.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. The column stypes are float32 for float32 columns, and float64 for all the other numeric types.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also

  • median() – function to calculate median values.

  • sd() – function to calculate standard deviation.

Examples

from datatable import dt, f, by df = dt.Frame({'A': [1, 1, 2, 1, 2], 'B': [None, 2, 3,4, 5], 'C': [1, 2, 1, 1, 2]}) df
ABC
int32int32int32
01NA1
1122
2231
3141
4252

Get the mean from column A:

df[:, dt.mean(f.A)]
A
float64
01.4

Get the mean of multiple columns:

df[:, dt.mean([f.A, f.B])]
AB
float64float64
01.43.5

Same as above, but applying to a column slice:

df[:, dt.mean(f[:2])]
AB
float64float64
01.43.5

You can pass in a dictionary with new column names:

df[:, dt.mean({"A_mean": f.A, "C_avg": f.C})]
A_meanC_avg
float64float64
01.41.4

In the presence of by(), it returns the average of each column per group:

df[:, dt.mean({"A_mean": f.A, "B_mean": f.B}), by("C")]
CA_meanB_mean
int32float64float64
011.333333.5
121.53.5

datatable.median()

Calculate the median value for each column from cols.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names, stypes and number of columns as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also

  • mean() – function to calculate mean values.

  • sd() – function to calculate standard deviation.

Examples

from datatable import dt, f, by df = dt.Frame({'A': [1, 1, 2, 1, 2], 'B': [None, 2, 3,4, 5], 'C': [1, 2, 1, 1, 2]}) df
ABC
int32int32int32
01NA1
1122
2231
3141
4252

Get the median from column A:

df[:, dt.median(f.A)]
A
float64
01

Get the median of multiple columns:

df[:, dt.median([f.A, f.B])]
AB
float64float64
013.5

Same as above, but more convenient:

df[:, dt.median(f[:2])]
AB
float64float64
013.5

You can pass in a dictionary with new column names:

df[:, dt.median({"A_median": f.A, "C_mid": f.C})]
A_medianC_mid
float64float64
011

In the presence of by(), it returns the median of each column per group:

df[:, dt.median({"A_median": f.A, "B_median": f.B}), by("C")]
CA_medianB_median
int32float64float64
0113.5
121.53.5

datatable.min()

Calculate the minimum value for each column from cols. It is recommended to use it as dt.min() to prevent conflict with the Python built-in min() function.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row and the same names, stypes and number of columns as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

Examples

from datatable import dt, f, by df = dt.Frame({'A': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'B': [3, 2, 20, 1, 6, 2, 3, 22, 1]}) df
AB
int32int32
013
112
2120
321
426
522
633
7322
831

Get the minimum from column B:

df[:, dt.min(f.B)]
B
int32
01

Get the minimum of all columns:

df[:, [dt.min(f.A), dt.min(f.B)]]
AB
int32int32
011

Same as above, but using the slice notation:

df[:, dt.min(f[:])]
AB
int32int32
011

In the presence of by(), it returns the row with the minimum value per group:

df[:, dt.min(f.B), by("A")]
AB
int32int32
012
121
231

See Also

  • max() – function to calculate maxium values.

datatable.qcut()

Added in version 0.11

Bin all the columns from cols into intervals with approximately equal populations. Thus, the intervals are chosen according to the sample quantiles of the data.

If there are duplicate values in the data, they will all be placed into the same bin. In extreme cases this may cause the bins to be highly unbalanced.

Parameters

cols
FExpr

Input data for quantile binning.

nquantiles
int | List[int]

When a single number is specified, this number of quantiles will be used to bin each column from cols.

When a list or a tuple is provided, each column will be binned by using its own number of quantiles. In the latter case, the list/tuple length must be equal to the number of columns in cols.

return
FExpr

f-expression that converts input columns into the columns filled with the respective quantile ids.

See also

cut() – function for equal-width interval binning.

datatable.rbind()

Produce a new frame by appending rows of several frames.

This function is equivalent to:

dt.Frame().rbind(*frames, force=force, by_names=by_names)

Parameters

frames
Frame | List[Frame] | None
force
bool
bynames
bool
return
Frame

Examples

from datatable import dt DT1 = dt.Frame({"Weight": [5, 4, 6], "Height": [170, 172, 180]}) DT1
WeightHeight
int32int32
05170
14172
26180
DT2 = dt.Frame({"Height": [180, 181, 169], "Weight": [4, 4, 5]}) DT2
WeightHeight
int32int32
04180
14181
25169
dt.rbind(DT1, DT2)
WeightHeight
int32int32
05170
14172
26180
34180
44181
55169

rbind() by default combines frames by names. The frames can also be bound by column position by setting the bynames parameter to False:

dt.rbind(DT1, DT2, bynames = False)
WeightHeight
int32int32
05170
14172
26180
31804
41814
51695

If the number of columns are not equal or the column names are different, you can force the row binding by setting the force parameter to True:

DT2["Age"] = dt.Frame([25, 50, 67]) DT2
WeightHeightAge
int32int32int32
0418025
1418150
2516967
dt.rbind(DT1, DT2, force = True)
WeightHeightAge
int32int32int32
05170NA
14172NA
26180NA
3418025
4418150
5516967

See also

  • cbind() – function for col-binding several frames.

  • dt.Frame.rbind() – Frame method for rbinding some frames to another.

datatable.repeat()

Concatenate n copies of the frame by rows and return the result.

This is equivalent to dt.rbind([frame] * n).

Example

from datatable import dt DT = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, 5]}) DT
AB
int32int32
01NA
112
223
314
425
dt.repeat(DT, 2)
AB
int32int32
01NA
112
223
314
425
51NA
612
723
814
925

datatable.rowall()

For each row in cols return True if all values in that row are True, or otherwise return False.

Parameters

cols
FExpr[bool]

Input boolean columns.

return
FExpr[bool]

f-expression consisting of one boolean column that has the same number of rows as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-boolean type.

Examples

from datatable import dt, f DT = dt.Frame({"A": [True, True], "B": [True, False], "C": [True, True]}) DT
ABC
bool8bool8bool8
0111
1101
DT[:, dt.rowall(f[:])]
C0
bool8
01
10

See Also

datatable.rowany()

For each row in cols return True if any of the values in that row are True, or otherwise return False. The function uses shortcut evaluation: if the True value is found in one of the columns, then the subsequent columns are skipped.

Parameters

cols
FExpr[bool]

Input boolean columns.

return
FExpr[bool]

f-expression consisting of one boolean column that has the same number of rows as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-boolean type.

Examples

from datatable import dt, f DT = dt.Frame({"A":[True, True], "B":[True, False], "C":[True, True]}) DT
ABC
bool8bool8bool8
0111
1101
DT[:, dt.rowany(f[:])]
C0
bool8
01
11

See Also

datatable.rowcount()

For each row, count the number of non-missing values in cols.

Parameters

cols
FExpr

Input columns.

return
FExpr

f-expression consisting of one int32 column and the same number of rows as in cols.

Examples

from datatable import dt, f DT = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, None], "C":[True, False, False, True, True]}) DT
ABC
int32int32bool8
01NA1
1120
2230
3141
42NA1

Note the exclusion of null values in the count:

DT[:, dt.rowcount(f[:])]
C0
int32
02
13
23
33
42

See Also

  • rowsum() – sum of all values row-wise.

datatable.rowfirst()

For each row, find the first non-missing value in cols. If all values in a row are missing, then this function will also produce a missing value.

Parameters

cols
FExpr

Input columns.

return
FExpr

f-expression consisting of one column and the same number of rows as in cols.

except
TypeError

The exception is raised when input columns have incompatible types.

Examples

from datatable import dt, f DT = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, None], "C": [True, False, False, True, True]}) DT
ABC
int32int32bool8
01NA1
1120
2230
3141
42NA1
DT[:, dt.rowfirst(f[:])]
C0
int32
01
11
22
31
42
DT[:, dt.rowfirst(f['B', 'C'])]
C0
int32
01
12
23
34
41

See Also

  • rowlast() – find the last non-missing value row-wise.

datatable.rowlast()

For each row, find the last non-missing value in cols. If all values in a row are missing, then this function will also produce a missing value.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression consisting of one column and the same number of rows as in cols.

except
TypeError

The exception is raised when input columns have incompatible types.

Examples

from datatable import dt, f DT = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, None], "C":[True, False, False, True, True]}) DT
ABC
int32int32bool8
01NA1
1120
2230
3141
42NA1
DT[:, dt.rowlast(f[:])]
C0
int32
01
10
20
31
41
DT[[1, 3], 'C'] = None DT
ABC
int32int32bool8
01NA1
112NA
2230
314NA
42NA1
DT[:, dt.rowlast(f[:])]
C0
int32
01
12
20
34
41

See Also

  • rowfirst() – find the first non-missing value row-wise.

datatable.rowmax()

For each row, find the largest value among the columns from cols.

Parameters

cols
FExpr

Input columns.

return
FExpr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is the smallest common stype for cols, but not less than int32.

except
TypeError

The exception is raised when cols has non-numeric columns.

Examples

from datatable import dt, f DT = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, None], "C":[True, False, False, True, True]}) DT
ABC
int32int32bool8
01NA1
1120
2230
3141
42NA1
DT[:, dt.rowmax(f[:])]
C0
int32
01
12
23
34
42

See Also

  • rowmin() – find the smallest element row-wise.

datatable.rowmean()

For each row, find the mean values among the columns from cols skipping missing values. If a row contains only the missing values, this function will produce a missing value too.

Parameters

cols
FExpr

Input columns.

return
FExpr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is float32 when all the cols are float32, and float64 in all the other cases.

except
TypeError

The exception is raised when cols has non-numeric columns.

Examples

from datatable import dt, f, rowmean DT = dt.Frame({'a': [None, True, True, True], 'b': [2, 2, 1, 0], 'c': [3, 3, 1, 0], 'd': [0, 4, 6, 0], 'q': [5, 5, 1, 0]}
DT
abcdq
bool8int32int32int32int32
0NA2305
112345
211161
310000

Get the row mean of all columns:

DT[:, rowmean(f[:])]
C0
float64
02.5
13
22
30.2

Get the row mean of specific columns:

DT[:, rowmean(f['a', 'b', 'd'])]
C0
float64
01
12.33333
22.66667
30.333333

See Also

  • rowsd() – calculate the standard deviation row-wise.

datatable.rowmin()

For each row, find the smallest value among the columns from cols, excluding missing values.

Parameters

cols
FExpr

Input columns.

return
FExpr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is the smallest common stype for cols, but not less than int32.

except
TypeError

The exception is raised when cols has non-numeric columns.

Examples

from datatable import dt, f DT = dt.Frame({"A": [1, 1, 2, 1, 2], "B": [None, 2, 3, 4, None], "C":[True, False, False, True, True]}) DT
ABC
int32int32bool8
01NA1
1120
2230
3141
42NA1
DT[:, dt.rowmin(f[:])]
C0
int32
01
10
20
31
41

See Also

  • rowmax() – find the largest element row-wise.

datatable.rowsd()

For each row, find the standard deviation among the columns from cols skipping missing values. If a row contains only the missing values, this function will produce a missing value too.

Parameters

cols
FExpr

Input columns.

return
FExpr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is float32 when all the cols are float32, and float64 in all the other cases.

except
TypeError

The exception is raised when cols has non-numeric columns.

Examples

from datatable import dt, f, rowsd DT = dt.Frame({'name': ['A', 'B', 'C', 'D', 'E'], 'group': ['mn', 'mn', 'kl', 'kl', 'fh'], 'S1': [1, 4, 5, 6, 7], 'S2': [2, 3, 8, 5, 1], 'S3': [8, 5, 2, 5, 3]}
DT
namegroupS1S2S3
str32str32int32int32int32
0Amn128
1Bmn435
2Ckl582
3Dkl655
4Efh713

Get the row standard deviation for all integer columns:

DT[:, rowsd(f[int])]
C0
float64
03.78594
11
23
30.57735
43.05505

Get the row standard deviation for some columns:

DT[:, rowsd(f[2, 3])]
C0
float64
00.707107
10.707107
22.12132
30.707107
44.24264

See Also

  • rowmean() – calculate the mean value row-wise.

datatable.rowsum()

For each row, calculate the sum of all values in cols. Missing values are treated as if they are zeros and skipped during the calcultion.

Parameters

cols
FExpr

Input columns.

return
FExpr

f-expression consisting of one column and the same number of rows as in cols. The stype of the resulting column will be the smallest common stype calculated for cols, but not less than int32.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

Examples

from datatable import dt, f, rowsum DT = dt.Frame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]}) DT
abcd
int32int32str32int32
012dd5
123ee9
234ff1
DT[:, rowsum(f[int])]
C0
int32
08
114
28
DT[:, rowsum(f.a, f.b)]
C0
int32
03
15
27

The above code could also be written as

DT[:, f.a + f.b]
C0
int32
03
15
27

See Also

  • rowcount() – count non-missing values row-wise.

datatable.sd()

Calculate the standard deviation for each column from cols.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. The column stypes are float32 for float32 columns, and float64 for all the other numeric types.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

Examples

from datatable import dt, f DT = dt.Frame(A = [0, 1, 2, 3], B = [0, 2, 4, 6]) DT
AB
int32int32
000
112
224
336

Get the standard deviation of column A:

DT[:, dt.sd(f.A)]
A
float64
01.29099

Get the standard deviation of columns A and B:

DT[:, dt.sd([f.A, f.B])]
AB
float64float64
01.290992.58199

See Also

  • mean() – function to calculate mean values.

  • median() – function to calculate median values.

datatable.setdiff()

Find the set difference between frame0 and the other frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will compute the set difference between the frame0 and the union of the other frames, returning those values that are present in the frame0, but not present in any of the frames.

Parameters

frame0
Frame

Input single-column frame.

*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns from the frames.

except
ValueError | NotImplementedError

dt.exceptions.ValueError

raised when one of the input frames, i.e. frame0 or any one from the frames, has more than one column.

dt.exceptions.NotImplementedError

raised when one of the columns has stype obj64.

Examples

from datatable import dt s1 = dt.Frame([4, 5, 6, 20, 42]) s2 = dt.Frame([1, 2, 3, 5, 42]) s1
C0
int32
04
15
26
320
442
s2
C0
int32
01
12
23
35
442

Set difference of the two frames:

dt.setdiff(s1, s2)
C0
int32
04
16
220

See Also

  • intersect() – calculate the set intersection of values in the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • union() – calculate the union of values in the frames.

  • unique() – find unique values in a frame.

datatable.shift()

Produce a column obtained from col shifting it n rows forward.

The shift amount, n, can be both positive and negative. If positive, a “lag” column is created, if negative it will be a “lead” column.

The shifted column will have the same number of rows as the original column, with n observations in the beginning becoming missing, and n observations at the end discarded.

This function is group-aware, i.e. in the presence of a groupby it will perform the shift separately within each group.

Examples

from datatable import dt, f, by DT = dt.Frame({"object": [1, 1, 1, 2, 2], "period": [1, 2, 4, 4, 23], "value": [24, 67, 89, 5, 23]}) DT
objectperiodvalue
int32int32int32
01124
11267
21489
3245
422323

Shift forward - Create a “lag” column:

DT[:, dt.shift(f.period, n = 3)]
period
int32
0NA
1NA
2NA
31
42

Shift backwards - Create “lead” columns:

DT[:, dt.shift(f[:], n = -3)]
objectperiodvalue
int32int32int32
0245
122323
2NANANA
3NANANA
4NANANA

Shift in the presence of by():

DT[:, f[:].extend({"prev_value": dt.shift(f.value)}), by("object")]
objectperiodvalueprev_value
int32int32int32int32
01124NA
1126724
2148967
3245NA
4223235

datatable.sort()

Sort clause for use in Frame’s square-bracket selector.

When a sort() object is present inside a DT[i, j, ...] expression, it will sort the rows of the resulting Frame according to the columns cols passed as the arguments to sort().

When used together with by(), the sort clause applies after the group-by, i.e. we sort elements within each group. Note, however, that because we use stable sorting, the operations of grouping and sorting are commutative: the result of applying groupby and then sort is the same as the result of sorting first and then doing groupby.

When used together with i (row filter), the i filter is applied after the sorting. For example:

DT[:10, :, sort(f.Highscore, reverse=True)]

will select the first 10 records from the frame DT ordered by the Highscore column.

Examples

from datatable import dt, f, by DT = dt.Frame({"col1": ["A", "A", "B", None, "D", "C"], "col2": [2, 1, 9, 8, 7, 4], "col3": [0, 1, 9, 4, 2, 3], "col4": [1, 2, 3, 3, 2, 1]}) DT
col1col2col3col4
str32int32int32int32
0A201
1A112
2B993
3NA843
4D722
5C431

Sort by a single column:

DT[:, :, dt.sort("col1")]
col1col2col3col4
str32int32int32int32
0NA843
1A201
2A112
3B993
4C431
5D722

Sort by multiple columns:

DT[:, :, dt.sort("col2", "col3")]
col1col2col3col4
str32int32int32int32
0A112
1A201
2C431
3D722
4NA843
5B993

Sort in descending order:

DT[:, :, dt.sort(-f.col1)]
col1col2col3col4
str32int32int32int32
0NA843
1D722
2C431
3B993
4A201
5A112

The frame can also be sorted in descending order by setting the reverse parameter to True:

DT[:, :, dt.sort("col1", reverse=True)]
col1col2col3col4
str32int32int32int32
0NA843
1D722
2C431
3B993
4A201
5A112

By default, when sorting, null values are placed at the top; to relocate null values to the bottom, pass last to the na_position parameter:

DT[:, :, dt.sort("col1", na_position="last")]
col1col2col3col4
str32int32int32int32
0A201
1A112
2B993
3C431
4D722
5NA843

Passing remove to na_position completely excludes any row with null values from the sorted output:

DT[:, :, dt.sort("col1", na_position="remove")]
col1col2col3col4
str32int32int32int32
0A201
1A112
2B993
3C431
4D722

Sort by multiple columns, descending and ascending order:

DT[:, :, dt.sort(-f.col2, f.col3)]
col1col2col3col4
str32int32int32int32
0B993
1NA843
2D722
3C431
4A201
5A112

The same code above can be replicated by passing a list of booleans to reverse:

DT[:, :, dt.sort("col2", "col3", reverse=[True, False])]
col1col2col3col4
str32int32int32int32
0B993
1NA843
2D722
3C431
4A201
5A112

In the presence of by(), sort() sorts within each group:

DT[:, :, by("col4"), dt.sort(f.col2)]
col4col1col2col3
int32str32int32int32
01A20
11C43
22A11
32D72
43NA84
53B99

datatable.split_into_nhot()

split_into_nhot
(
sep=",",
sort=False
)
Deprecated since version 1.0.0

This function is deprecated and will be removed in version 1.1.0. Please use dt.str.split_into_nhot() instead.

datatable.sum()

Calculate the sum of values for each column from cols.

Parameters

cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. The column stypes are int64 for boolean and integer columns, float32 for float32 columns and float64 for float64 columns.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

Examples

from datatable import dt, f, by df = dt.Frame({'A': [1, 1, 2, 1, 2], 'B': [None, 2, 3,4, 5], 'C': [1, 2, 1, 1, 2]}) df
ABC
int32int32int32
01NA1
1122
2231
3141
4252

Get the sum of column A:

df[:, dt.sum(f.A)]
A
int64
07

Get the sum of multiple columns:

df[:, [dt.sum(f.A), dt.sum(f.B)]]
AB
int64int64
0714

Same as above, but more convenient:

df[:, dt.sum(f[:2])]
AB
int64int64
0714

In the presence of by(), it returns the sum of the specified columns per group:

df[:, [dt.sum(f.A), dt.sum(f.B)], by(f.C)]
CAB
int32int64int64
0147
1237

See Also

  • count() – function to calculate a number of non-missing values.

datatable.symdiff()

Find the symmetric difference between the sets of values in all frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the symmetric difference operation on these sets.

The symmetric difference of two frames are those values that are present in either of the frames, but not in the both. The symmetric difference of more than two frames are those values that are present in an odd number of frames.

Parameters

*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns from the frames.

except
ValueError | NotImplementedError

dt.exceptions.ValueError

raised when one of the input frames has more than one column.

dt.exceptions.NotImplementedError

raised when one of the columns has stype obj64.

Examples

from datatable import dt df = dt.Frame({'A': [1, 1, 2, 1, 2], 'B': [None, 2, 3, 4, 5], 'C': [1, 2, 1, 1, 2]}) df
ABC
int32int32int32
01NA1
1122
2231
3141
4252

Symmetric difference of all the columns in the entire frame; Note that each column is treated as a separate frame:

dt.symdiff(*df)
A
int32
0NA
12
23
34
45

Symmetric difference between two frames:

dt.symdiff(df["A"], df["B"])
A
int32
0NA
11
23
34
45

See Also

  • intersect() – calculate the set intersection of values in the frames.

  • setdiff() – calculate the set difference between the frames.

  • union() – calculate the union of values in the frames.

  • unique() – find unique values in a frame.

datatable.union()

Find the union of values in all frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the union operation on these sets.

The dt.union(*frames) operation is equivalent to dt.unique(dt.rbind(*frames)).

Parameters

*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns in the frames.

except
ValueError | NotImplementedError

dt.exceptions.ValueError

raised when one of the input frames has more than one column.

dt.exceptions.NotImplementedError

raised when one of the columns has stype obj64.

Examples

from datatable import dt df = dt.Frame({'A': [1, 1, 2, 1, 2], 'B': [None, 2, 3,4, 5], 'C': [1, 2, 1, 1, 2]}) df
ABC
int32int32int32
01NA1
1122
2231
3141
4252

Union of all the columns in a frame:

dt.union(*df)
A
int32
0NA
11
22
33
44
55

Union of two frames:

dt.union(df["A"], df["C"])
A
int32
01
12

See Also

  • intersect() – calculate the set intersection of values in the frames.

  • setdiff() – calculate the set difference between the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • unique() – find unique values in a frame.

datatable.unique()

Find the unique values in all the columns of the frame.

This function sorts the values in order to find the uniques, so the return values will be ordered. However, this should be considered an implementation detail: in the future datatable may switch to a different algorithm, such as hash-based, which may return the results in a different order.

Parameters

frame
Frame

Input frame.

return
Frame

A single-column frame consisting of unique values found in frame. The column stype is the smallest common stype for all the frame columns.

except
NotImplementedError

The exception is raised when one of the frame columns has stype obj64.

Examples

from datatable import dt df = dt.Frame({'A': [1, 1, 2, 1, 2], 'B': [None, 2, 3,4, 5], 'C': [1, 2, 1, 1, 2]}) df
ABC
int32int32int32
01NA1
1122
2231
3141
4252

Unique values in the entire frame:

dt.unique(df)
C0
int32
0NA
11
22
33
44
55

Unique values in a frame with a single column:

dt.unique(df["A"])
A
int32
01
12

See Also

  • intersect() – calculate the set intersection of values in the frames.

  • setdiff() – calculate the set difference between the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • union() – calculate the union of values in the frames.

datatable.update()

Create new or update existing columns within a frame.

This expression is intended to be used at “j” place in DT[i, j] call. It takes an arbitrary number of key/value pairs each describing a column name and the expression for how that column has to be created/updated.

Examples

from datatable import dt, f, by, update DT = dt.Frame([range(5), [4, 3, 9, 11, -1]], names=("A", "B")) DT
AB
int32int32
004
113
229
3311
44-1

Create new columns and update existing columns:

DT[:, update(C = f.A * 2, D = f.B // 3, A = f.A * 4, B = f.B + 1)] DT
ABCD
int32int32int32int32
00501
14421
281043
3121263
41608-1

Add new column with unpacking; this can be handy for dynamicallly adding columns with dictionary comprehensions, or if the names are not valid python keywords:

DT[:, update(**{"extra column": f.A + f.B + f.C + f.D})] DT
ABCDextra column
int32int32int32int32int32
005016
1442111
28104325
312126333
41608-123

You can update a subset of data:

DT[f.A > 10, update(A = f.A * 5)] DT
ABCDextra column
int32int32int32int32int32
005016
1442111
28104325
360126333
48008-123

You can also add a new column or update an existing column in a groupby operation, similar to SQL’s window operation, or pandas transform():

df = dt.Frame("""exporter assets liabilities False 5 1 True 10 8 False 3 1 False 24 20 False 40 2 True 12 11""") # Get the ratio for each row per group df[:, update(ratio = dt.sum(f.liabilities) * 100 / dt.sum(f.assets)), by(f.exporter)] df
exporterassetsliabilitiesratio
bool8int32int32float64
005133.3333
1110886.3636
203133.3333
30242033.3333
4040233.3333
51121186.3636

Development

Contributing

datatable is an open-source project released under the Mozilla Public License v2. Open source projects live by their user and developer communities. We welcome and encourage your contributions of any kind!

No matter what your skill set or level of engagement is with datatable, you can help others by improving the ecosystem of documentation, bug report and feature request tickets, and code.

We invite anyone who is interested to contribute, whether through pull requests, tests, GitHub issues, feature suggestions, or even generic discussion.

If you have questions about using datatable, post them on Stack Overflow using the [py-datatable] tag.

Preparing local copy of datatable repository

If this is the first time you’re contributing to datatable, then follow these steps in order to set up your local development environment:

  1. Make sure you have command-line tools git and make installed. You should also have a text editor or an IDE of your choice.

  2. Go to https://github.com/h2oai/datatable and click the “fork” button in the top right corner. You may need to create a GitHub account if you don’t have one already.

  3. Clone the repository on your local computer:

    $ git clone https://github.com/your_user_name/datatable
  4. Lastly, add the original datatable repository as the upstream:

    $ cd datatable $ git remote add upstream https://github.com/h2oai/datatable $ git fetch upstream $ git config branch.main.remote upstream $ git config branch.main.merge refs/heads/main

This completes the setup of your local datatable fork. Make sure to note the location of the datatable/ directory that you created in step 3. You will need to return there when issuing any subsequent git commands detailed futher.

Creating a contribution

Start by fetching any changes that might have occurred since the last time you were working with the repository:

$ git checkout main $ git pull

Then create a new local branch where you will be working on your changes. The name of the branch should be a short identifier that will help you recognize what this branch is about. It’s a good idea to prefix the branch name with your initials so that it doesn’t conflict with branches from other developers:

$ git checkout -b your_branch_name

After this it is time to make the desired changes to the project. There are separate guides on how to work with documentation and how to work with core code changes. It is also a good idea to commit the code frequently, using git add and git commit changes.

Note: While many projects ask for detailed and informative commit messages, we don’t. Our policy is to squash all commits when merging a pull request, and therefore the only detailed message that is needed is the PR description.

When you think your proposed change is ready, verify that everything is in order by running git status – it should say “nothing to commit, working tree clean”. At this point the changes need to be pushed into the “origin”, which is your repository fork:

$ git push origin your_branch_name

Then go back to the GitHub website to your fork of the datatable repository https://github.com/your_user_name/datatable. There you should see a pop-up that notifies about the changes pushed to your_branch_name. There will also be a green button “Compare & pull request”. Pressing that button you will see an “Open a pull request” form.

When opening a pull request, make sure to provide an informative title and a detailed description of the proposed changes. If the pull request directly addresses one of the issues, make sure to note that in the text of the PR description.

Make sure the checkbox “Allow edits by maintainers” is turned on, and then press “Create pull request”.

At this point your Pull Request will be scheduled for review at the main datatable repository. Once reviewed, you may be asked to change something, in which case you can make the necessary modifications locally, then commit and push them.

Contributing documentation

The documentation for datatable project is written entirely in the ReStructured Text (RST) format and rendered using the Sphinx engine. These technologies are standard for Python.

The basic workflow for developing documentation, after setting up a local datatable repository, is to go into the docs/ directory and run

$ make html

After that, if there were no errors, the documentation can be viewed locally by opening the file docs/_build/html/index.html in a browser.

The make html command needs to be re-run after every change you make. Occasionally you may also need to make clean if something doesn’t seem to work properly.

Basic formatting

At the most basic level, RST document is a plain text, where paragraphs are separated with empty lines:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

The line breaks within each paragraph are ignored; on the rendered page the lines will be as wide as is necessary to fill the page. With that in mind, we ask to avoid lines that are longer than 80 characters in width, if possible. This makes it much easier to work with code on small screen devices.

Page headings are a simple line of underlined text:

Heading Level 1 =============== Heading Level 2 --------------- Heading Level 3 ~~~~~~~~~~~~~~~

Each document must have exactly one level-1 heading; otherwise the page would not render properly.

Basic bold text, italic text and literal text is written as follows. (Note that literals use double backticks, which is a frequent cause of formatting errors.):

**bold text** *italic text* ``literal text``

Bulleted and ordered lists are done similarly to Markdown:

- list item 1; - list item 2; - a longer list item, that might need to be be carried over to the next line. 1. ordered list item 1 2. ordered list item 2 This is the next paragraph of list item 2.

The content of each list item can be arbitrarily complex, as long as it is properly indented.

Code blocks

There are two main ways to format a block of code. The simplest way is to finish a paragraph with a double-colon :: and then start the next paragraph (code) indented with 4 spaces:

Here is a code example:: >>> print("Hello, world!", flush=True)

In this case the code will be highlighted assuming it is a python sample. If the code corresponds to some other language, you’ll need to use an explicit code-block directive:

.. code-block:: shell $ pip install datatable

This directive allows you to explicitly select the language of your code snippet, which will affect how it is highlighted. The code inside code-block must be indented, and there has to be an empty line between the .. code-block:: declaration and the actual code.

When writing python code examples, the best practice is to use python console format, i.e. prepend all input lines with >>> (or ... for continuation lines), and keep all output lines without a prefix. When documenting an error, remove all traceback information and leave only the error message:

>>> import datatable as dt >>> DT = dt.Frame(A=[5], B=[17], D=['zelo']) >>> DT | A B D | int32 int32 str32 -- + ----- ----- ----- 0 | 5 17 zelo [1 row x 3 columns] >>> DT.hello() AttributeError: 'datatable.Frame' object has no attribute 'hello'

This code snippet will be rendered as follows:

import datatable as dt DT = dt.Frame(A=[5], B=[17], D=['zelo']) DT
ABD
int32int32str32
0517zelo
DT.hello()
AttributeError: 'datatable.Frame' object has no attribute 'hello'

Advanced directives

All rst documents are arranged into a tree. All non-leaf nodes of this tree must include a .. toctree:: directive, which may also be declared hidden:

.. toctree:: :hidden: child_doc_1 Explicit name <child_doc_2>

The .. image:: directive can be used to insert an image, which may also be a link:

.. image:: <image URL> :target: <target URL if the image is a link>

In order to note that some functionality was added or changed in a specific version, use:

.. x-version-added:: 0.10.0 .. x-version-deprecated:: 1.0.0 .. x-version-changed:: 0.11.0 Here's what changed: blah-blah-blah

The .. seealso:: directive adds a Wikipedia-style “see also:” entry at the beginning of a section. The argument of this directive should contain a link to the content that you want the user to see. This directive is best to include immeditately after a heading:

.. seealso:: :ref:`columnsets`

Directive .. x-comparison-table:: allows to create a two-column table specifically designed for comparing two entities across multiple comparison points. It is primarily used to create the “compare datatable with another library” manual pages. The content of this directive is comprised of multiple “sections” separated with ====, and each section has 2 or 3 parts (separated with ----): an optional common header, then the content of the first column, and then the second:

.. x-comparison-table:: :header1: datatable :header2: another-library Section 1 header ---- Column 1 ---- Column 2 ==== Section 2 header ---- Column 1 ---- Column 2

Changelog support

RST is language that supports extensions. One of the custom extensions that we use supports maintaining a changelog. First, the .. changelog:: directive which is used in releases/vN.N.N.rst files declares that each of those files describes a particular release of datatable. The format is as follows:

.. changelog:: :version: <version number> :released: <release date> :wheels: URL1 URL2 etc. changelog content... .. contributors:: N @username <full name> -- N @username <full name>

The effect of this declaration is the following:

  • The title of the page is automatically inserted, together with an anchor that can be used to refer to this page;

  • A Wikipedia-style infobox is added on the right side of the page. This infobox contains the release date, links to the previous/next release, and the links to all wheels that where released at that version. The wheels are grouped by the python version / operating system. An sdist link may also be included as one of the “wheels”.

  • Within the .. changelog:: directive, a special form of list items is supported:

    -[new] New feature that was added -[enh] Improvement of an existing feature or function -[fix] Bug fix -[api] API change

    In addition, if any such item ends with the text of the form [#333], then this will be automatically converted into a link to a github issue/PR with that number.

  • The .. contributors:: directive can only be used inside a changelog, and it should list the contributors who participated in creation of this particular release. The list of contributors is prepared using the script ci/gh.py

Documenting API

When it comes to documenting specific functions/classes/methods of the datatable module, we use another extension: .. xfunction:: (or .. xclass::, .. xmethod::, etc). This is because this part of the documentation is declared within the C++ code, so that it can be available from within a regular python session.

Inside the documentation tree, each function/method/etc that has to be documented is declared as follows:

.. xfunction:: datatable.rbind :src: src/core/frame/rbind.cc py_rbind :doc: src/core/frame/rbind.cc doc_py_rbind :tests: tests/munging/test-rbind.py

Here we declare the function dt.rbind(), whose source code is located in file src/core/frame/rbind.cc in function py_rbind(). The docstring of this function is located in the same file in a variable static const char* doc_py_rbind. The content of the latter variable will be pre-processed and then rendered as RST. The :doc: parameter is optional, if omitted the directive will attempt to find the docstring automatically.

The optional :tests: parameter should point to a file where the tests for this function are located. This will be included as a link in the rendered output.

In order to document a getter/setter property of a class, use the following:

.. xdata:: datatable.Frame.key :src: src/core/frame/key.cc Frame::get_key Frame::set_key :doc: src/core/frame/key.cc doc_key :tests: tests/test-keys.py :settable: new_key :deletable:

The :src: parameter can now accept two function names: the getter and the setter. In addition, the :settable: parameter will have the name of the setter value as it will be displayed in the docs. Lastly, :deletable: marks this class property as deletable.

The docstring of the function/method/etc is preprocessed before it is rendered into the RST document. This processing includes the following steps:

  • The “Parameters” section is parsed and the definitions of all function parameters are extracted.

  • The contents of the “Examples” section are parsed as if it was a literal block, converting from python-console format into the format jupyter-style code blocks. In addition, if the output of any command contains a datatable Frame, it will also be converted into a Jupyter-style table.

  • All other sections are displayed as-is.

Here’s an example of a docstring:

static const char* doc_rbind = R"(rbind(self, *frames, force=False, bynames=True) -- Append rows of `frames` to the current frame. This method modifies the current frame in-place. If you do not want the current frame modified, then use the :func:`dt.rbind()` function. Parameters ---------- frames: Frame | List[Frame] One or more frames to append. force: bool If True, then the frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs. bynames: bool If True (default), the columns in frames are matched by their names, otherwise by their order. Examples -------- >>> DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0]) >>> frame1 = dt.Frame(A=[-1], B=[None]) >>> DT.rbind(frame1) >>> DT | A B -- + -- -- 0 | 1 4 1 | 2 7 2 | 3 0 3 | -1 NA -- [4 rows x 2 columns] )";

Creating a new FExpr

The majority of functions available from datatable module are implemented via the FExpr mechanism. These functions have the same common API: they accept one or more FExprs (or fexpr-like objects) as arguments and produce an FExpr as the output. The resulting FExprs can then be used inside the DT[...] call to apply these expressions to a particular frame.

In this document we describe how to create such FExpr-based function. In particular, we describe adding the gcd(a, b) function for computing the greatest common divisor of two integers.

C++ “backend” class

The core of the functionality will reside within a class derived from the class dt::expr::FExpr. So let’s create the file expr/fexpr_gcd.cc and declare the skeleton of our class:

#include "expr/fexpr_func.h" #include "expr/eval_context.h" #include "expr/workframe.h" namespace dt { namespace expr { class FExpr_Gcd : public FExpr_Func { private: ptrExpr a_; ptrExpr b_; public: FExpr_Gcd(ptrExpr&& a, ptrExpr&& b) : a_(std::move(a)), b_(std::move(b)) {} std::string repr() const override; Workframe evaluate_n(EvalContext& ctx) const override; }; }}

In this example we are inheriting from FExpr_Func, which is a slightly more specialized version of FExpr.

You can also see that the two arguments in gcd(a, b) are stored within the class as ptrExpr a_, b_. This ptrExpr is actually a typedef for std::shared_ptr<FExpr>, which means that arguments to our FExpr are also FExprs.

The first method that needs to be implemented is repr(), which is more-or-less equivalent to python’s __repr__. The returned string should not have the name of the class in it, instead it must be ready to be combined with reprs of other expressions:

std::string repr() const override { std::string out = "gcd("; out += a_->repr(); out += ", "; out += b_->repr(); out += ')'; return out; }

We construct our repr out of reprs of a_ and b_. They are joined with a comma, which has the lowest precedence in python. For some other FExprs we may need to take into account the precedence of the arguments as well, in order to properly set up parentheses around subexpressions.

The second method to implement is evaluate_n(). The _n suffix here stands for “normal”. If you look into the source of FExpr class, you’ll see that there are other evaluation methods too: evaluate_i(), evaluate_j(), etc. However all of those are not needed when implementing a simple function.

The method evaluate_n() takes an EvalContext object as the argument. This object contains information about the current evaluation environment. The output from evaluate_n() should be a Workframe object. A workframe can be thought of as a “work-in-progress” frame. In our case it is sufficient to treat it as a simple vector of columns.

We begin implementing evaluate_n() by evaluating the arguments a_ and b_ and then making sure that those frames are compatible with each other (i.e. have the same number of columns and rows). After that we compute the result by iterating through the columns of both frames and calling a simple method evaluate1(Column&&, Column&&) (that we still need to implement):

Workframe evaluate_n(EvalContext& ctx) const override { Workframe awf = a_->evaluate_n(ctx); Workframe bwf = b_->evaluate_n(ctx); if (awf.ncols() == 1) awf.repeat_column(bwf.ncols()); if (bwf.ncols() == 1) bwf.repeat_column(awf.ncols()); if (awf.ncols() != bwf.ncols()) { throw TypeError() << "Incompatible number of columns in " << repr() << ": the first argument has " << awf.ncols() << ", while the " << "second has " << bwf.ncols(); } awf.sync_grouping_mode(bwf); auto gmode = awf.get_grouping_mode(); Workframe outputs(ctx); for (size_t i = 0; i < awf.ncols(); ++i) { Column rescol = evaluate1(awf.retrieve_column(i), bwf.retrieve_column(i)); outputs.add_column(std::move(rescol), std::string(), gmode); } return outputs; }

The method evaluate1() will take a pair of two columns and produce the output column containing the result of gcd(a, b) calculation. We must take into account the stypes of both columns, and decide which stypes are acceptable for our function:

Column evaluate1(Column&& a, Column&& b) const { SType stype1 = a.stype(); SType stype2 = b.stype(); SType stype0 = common_stype(stype1, stype2); switch (stype0) { case SType::BOOL: case SType::INT8: case SType::INT16: case SType::INT32: return make<int32_t>(std::move(a), std::move(b), SType::INT32); case SType::INT64: return make<int64_t>(std::move(a), std::move(b), SType::INT64); default: throw TypeError() << "Invalid columns of types " << stype1 << " and " << stype2 << " in " << repr(); } } template <typename T> Column make(Column&& a, Column&& b, SType stype0) const { a.cast_inplace(stype0); b.cast_inplace(stype0); return Column(new Column_Gcd<T>(std::move(a), std::move(b))); }

As you can see, the job of the FExpr_Gcd class is to produce a workframe containing one or more Column_Gcd virtual columns. This is where the actual calculation of GCD values will take place, and we shall declare this class too. It can be done either in a separate file in the core/column/ folder, or inside the current file expr/fexpr_gcd.cc.

#include "column/virtual.h" template <typename T> class Column_Gcd : public Virtual_ColumnImpl { private: Column acol_; Column bcol_; public: Column_Gcd(Column&& a, Column&& b) : Virtual_ColumnImpl(a.nrows(), a.stype()), acol_(std::move(a)), bcol_(std::move(b)) { xassert(acol_.nrows() == bcol_.nrows()); xassert(acol_.stype() == bcol_.stype()); xassert(acol_.can_be_read_as<T>()); } ColumnImpl* clone() const override { return new Column_Gcd(Column(acol_), Column(bcol_)); } size_t n_children() const noexcept { return 2; } const Column& child(size_t i) { return i==0? acol_ : bcol_; } bool get_element(size_t i, T* out) { T a, b; bool avalid = acol_.get_element(i, &a); bool bvalid = bcol_.get_element(i, &b); if (avalid && bvalid) { while (b) { T tmp = b; b = a % b; a = tmp; } *out = a; return true; } return false; } };

Python-facing gcd() function

Now that we have created the FExpr_Gcd class, we also need to have a python function responsible for creating these objects. This is done in 4 steps:

First, declare a function with signature py::oobj(const py::XArgs&). The py::XArgs object here encapsulates all parameters that were passed to the function, and it returns a py::oobj, which is a simple wrapper around python’s PyObject*.

static py::oobj py_gcd(const py::XArgs& args) { auto a = args[0].to_oobj(); auto b = args[1].to_oobj(); return PyFExpr::make(new FExpr_Gcd(as_fexpr(a), as_fexpr(b))); }

This function takes the python arguments, if necessary validates and converts them into C++ objects, then creates a new FExpr_Gcd object, and then returns it wrapped into a PyFExpr (which is a python equivalent of the generic FExpr class).

In the second step, we declare the signature and the docstring of this python function:

DECLARE_PYFN(&py_gcd) ->name("gcd") ->docs(dt::doc_gcd) ->arg_names({"a", "b"}) ->n_positional_args(2) ->n_required_args(2);

The variable doc_gcd must be declared in the common “documentation.h” file:

extern const char* doc_gcd;

The actual documentation should be written in a separate .rst file (more on this later), and then it will be added into the code during the compilation stage via the auto-generated file “documentation.cc”.

At this point the method will be visible from python in the _datatable module. So the next step is to import it into the main datatable module. To do this, go to src/datatable/__init__.py and write

from .lib._datatable import ( ... gcd, ... ) ... __all__ = ( ... "gcd", ... )

Tests

Any functionality must be properly tested. We recommend creating a dedicated test file for each new function. Thus, create file tests/expr/test-gcd.py and add some tests in it. We use the pytest framework for testing. In this framework, each test is a single function (whose name starts with test_) which performs some actions and then asserts the validity of results.

import pytest import random from datatable import dt, f, gcd from tests import assert_equals # checks equality of Frames from math import gcd as math_gcd def test_equal_columns(): DT = dt.Frame(A=[1, 2, 3, 4, 5]) RES = DT[:, gcd(f.A, f.A)] assert_equals(RES, dt.Frame([1, 1, 1, 1, 1]/dt.int32)) @pytest.mark.parametrize("seed", [random.getrandbits(63)]) def test_random(seed): random.seed(seed) n = 100 src1 = [random.randint(1, 1000) for i in range(n)] src2 = [random.randint(1, 100) for i in range(n)] DT = dt.Frame(A=src1, B=src2) RES = DT[:, gcd(f.A, f.B)] assert_equals(RES, dt.Frame([math_gcd(src1[i], src2[i]) for i in range(n)]))

When writing tests try to test any corner cases that you can think of. For example, what if one of the numbers is 0? Negative? Add tests for various column types, including invalid ones.

Documentation

The final piece of the puzzle is the documentation. We’ve already created variable doc_gcd earlier, which will ensure that documentation will be visible from python when you run help(gcd). However, the primary place where people look for documentation is on a dedicated readthedocs website, and this is where we will be adding the actual content.

So, create file docs/api/dt/gcd.rst. The content of the file could be something like this:

.. xfunction:: datatable.gcd :src: src/core/fexpr/fexpr_gcd.cc py_gcd :tests: tests/expr/test-gcd.py :cvar: doc_gcd :signature: gcd(a, b) Compute the greatest common divisor of `a` and `b`. Parameters ---------- a, b: FExpr Only integer columns are supported. return: FExpr The returned column will have stype int64 if either `a` or `b` are of type int64, or otherwise it will be int32.

In these lines we declare:

  • the name of the function which provides the gcd functionality (this is presented to the user as the “src” link in the generated docs);

  • the name of the file dedicated to testing this functionality, this will also become a link in the generated documentation;

  • the name of the C variable declared in “documentation.h” which should be given a copy of the documentation, so that it can be embedded into python;

  • the main signature of the function: its name and parameters (with defaults if necessary).

This RST file now needs to be added to the toctree: open the file docs/api/index-api.rst and add it into the .. toctree:: list at the bottom, and also add it to the table of all functions.

Lastly, open docs/releases/v{LATEST}.rst (this is our changelog) and write a brief paragraph about the new function:

Frame ----- ... -[new] Added new function :func:`gcd()` to compute the greatest common divisor of two columns. [#NNNN]

The [#NNNN] is a link to the GitHub issue where the gcd() function was requested.

Submodules

Some functions are declared within submodules of the datatable module. For example, math-related functions can be found in dt.math, string functions in dt.str, etc. Declaring such functions is not much different from what is described above. For example, if we wanted our gcd() function to be in the dt.math submodule, we’d made the following changes:

  • Create file expr/math/fexpr_gcd.cc instead of expr/fexpr_gcd.cc;

  • Instead of importing the function in src/datatable/__init__.py we’d have imported it from src/datatable/math.py;

  • The test file name can be tests/math/test-gcd.py instead of tests/expr/test-gcd.py;

  • The doc file name can be docs/api/math/gcd.rst instead of docs/api/dt/gcd.rst, and it should be added to the toctree in docs/api/math.rst.

Test page

This is a test page, it has no useful content. It’s sole purpose is to collect various elements of markup to ensure that they render properly in the current theme. For developers we recommend to visually check this page after any tweak to the accompanying CSS files.

If you notice any visual irregularities somewhere else within the documentation, please add those examples to this file as a kind of “manual test”.

Inline markup

  • Bold text is not actually corageous, it merely looks thicker.

  • Partially bold text;

  • Italic text is still English, except the latters are slanted.

  • Literal text is no more literal than any other text, but uses the monospace font.

  • ABCs, or ABC’s?

  • Ctrl+Alt+Del is a keyboard shortcut (:kbd: role)

  • subscript text can be used if you need to go low (:sub:)

  • superscript text but if they go low, we go high! (:sup:)

  • label may come in handy too (:guilabel:)

The smartquotes Sphinx plugin is responsible for converting “straight” quotes ("") into “typographic” quotes (“”). Similarly for the ‘single’ quotes ('' into ‘’). Don’t forget about single quotes in the middle of a word, or at the end’ of a word. Lastly, double-dash (--) should be rendered as an n-dash – like this, and triple-dash (---) as an m-dash — like this.

Hyperlinks may come in a variety of different flavors. In particular, links leading outside of this website must have clear indicator that they are external. The internal links should not have such an indicator:

Headers

This section is dedicated to headers of various levels. Note that there can be only one level-1 header (at the top of the page). All other headers are therefore level-2 or smaller. At the same time, headers below level 4 are not supported.

Sub-header A

Paragraph within a subheader. Please check that the spacing between the headers and the text looks reasonable.

Sub-sub header A.1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Sub-sub header A.2

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Sub-header B

Nothing to see here, move along citizen.

Lists

Embedding lists into text may be somewhat tricky, for example here’s the list that contains a short enumeration of items. It is supposed to be rendered in a “compact” style (.simple in CSS):

  • one

  • two

  • three

Same list, but ordered:

  1. one

  2. two

  3. three

Finally, a more complicated list that is still considered “simple” by docutils (see SimpleListChecker: a list is simple if every list item contains either a single paragraph, or a paragraph followed by a simple list). Here we exhibit four variants of the same list, altering ordered/unordered property:

  • Lorem ipsum dolor sit

    • amet

    • consectetur

    • adipiscing

    • elit,

  • sed do eiusmod

  • tempor

  • incididnut ut labore

    • et dolore

    • magna aliqua.

  • Ut enim ad minim

  1. Lorem ipsum dolor sit

    • amet

    • consectetur

    • adipiscing

    • elit,

  2. sed do eiusmod

  3. tempor

  4. incididnut ut labore

    • et dolore

    • magna aliqua.

  5. Ut enim ad minim

  • Lorem ipsum dolor sit

    1. amet

    2. consectetur

    3. adipiscing

    4. elit,

  • sed do eiusmod

  • tempor

  • incididnut ut labore

    1. et dolore

    2. magna aliqua.

  • Ut enim ad minim

  1. Lorem ipsum dolor sit

    1. amet

    2. consectetur

    3. adipiscing

    4. elit,

  2. sed do eiusmod

  3. tempor

  4. incididnut ut labore

    1. et dolore

    2. magna aliqua.

  5. Ut enim ad minim

Compare this to the following list, which is supposed to be rendered with more spacing between the elements:

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

  • Vitae purus faucibus ornare suspendisse sed. Sit amet mauris commodo quis imperdiet. Id velit ut tortor pretium viverra suspendisse potenti nullam ac. Enim eu turpis egestas pretium aenean.

    Neque laoreet suspendisse interdum consectetur libero. Tellus elementum sagittis vitae et leo duis ut. Vel pretium lectus quam id leo. Eget nunc scelerisque viverra mauris in. Integer enim neque volutpat ac tincidunt vitae semper quis lectus. Urna molestie at elementum eu facilisis sed.

  • Molestie at elementum eu facilisis sed. Nisi vitae suscipit tellus mauris a diam maecenas sed enim. Morbi tincidunt ornare massa eget egestas. Condimentum lacinia quis vel eros. Viverra accumsan in nisl nisi scelerisque. Lorem sed risus ultricies tristique. Phasellus egestas tellus rutrum tellus pellentesque eu tincidunt tortor aliquam. Semper feugiat nibh sed pulvinar. Quis hendrerit dolor magna eget est lorem ipsum dolor.

    Amet commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Pellentesque elit eget gravida cum sociis natoque. Sit amet risus nullam eget felis eget nunc lobortis mattis.

    Tellus rutrum tellus pellentesque eu tincidunt tortor. Eget arcu dictum varius duis. Eleifend mi in nulla posuere sollicitudin aliquam ultrices sagittis orci.

  • Ut ornare lectus sit amet est placerat in. Leo urna molestie at elementum. At auctor urna nunc id. Risus at ultrices mi tempus imperdiet nulla malesuada.

The next section demonstrates how different kinds of lists nest within each other.

Bill of Rights

The Conventions of a number of the States having at the time of their adopting the Constitution, expressed a desire, in order to prevent misconstruction or abuse of its powers, that further declaratory and restrictive clauses should be added: And as extending the ground of public confidence in the Government, will best insure the beneficent ends of its institution

Resolved by the Senate and House of Representatives of the United States of America, in Congress assembled, two thirds of both Houses concurring, that the following Articles be proposed to the Legislatures of the several States, as Amendments to the Constitution of the United States, all or any of which Articles, when ratified by three fourths of the said Legislatures, to be valid to all intents and purposes, as part of the said Constitution; viz.:

Articles in addition to, and Amendment of the Constitution of the United States of America, proposed by Congress, and ratified by the Legislatures of the several States, pursuant to the fifth Article of the original Constitution.

  1. Congress shall make no law respecting

    • an establishment of religion, or prohibiting the free exercise thereof; or

    • abridging the freedom of speech, or of the press; or

    • the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.

  2. A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.

  3. No Soldier shall, in time of peace be quartered in any house, without the consent of the Owner, nor in time of war, but in a manner to be prescribed by law.

  4. The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.

  5. No person shall be

    • held to answer for a capital, or otherwise infamous crime, unless on a presentment or indictment of a Grand Jury, except in cases arising in the land or naval forces, or in the Militia, when in actual service in time of War or public danger; nor shall any person be

    • subject for the same offence to be twice put in jeopardy of life or limb; nor shall be

    • compelled in any criminal case to be a witness against himself, nor be

    • deprived of

      • life,

      • liberty, or

      • property,

      without due process of law;

    • nor shall private property be taken for public use, without just compensation.

  6. In all criminal prosecutions, the accused shall enjoy the right to a speedy and public trial, by an impartial jury of the State and district wherein the crime shall have been committed, which district shall have been previously ascertained by law, and to be informed of the nature and cause of the accusation; to be confronted with the witnesses against him; to have compulsory process for obtaining witnesses in his favor, and to have the Assistance of Counsel for his defence.

  7. In Suits at common law, where the value in controversy shall exceed twenty dollars, the right of trial by jury shall be preserved, and no fact tried by a jury, shall be otherwise re-examined in any Court of the United States, than according to the rules of the common law.

  8. Excessive bail shall not be required, nor excessive fines imposed, nor cruel and unusual punishments inflicted.

  9. The enumeration in the Constitution, of certain rights, shall not be construed to deny or disparage others retained by the people.

  10. The powers not delegated to the United States by the Constitution, nor prohibited by it to the States, are reserved to the States respectively, or to the people.

Code samples

Literal block after a paragraph. The spacing between this text and the code block below should be small, similar to regular spacing between lines:

import datatable as dt DT = dt.Frame(A = [3, 1, 4, 1, 5]) DT.shape
(5, 1)
repr(DT)
'<Frame#7fe06e063ca8 5x1>'
# This is how a simple frame would be rendered: DT
A
03
11
24
31
45
DT + DT
TypeError: unsupported operand type(s) for +: 'datatable.Frame' and 'datatable.Frame'

This is a paragraph after the code block. The spacing should be roughly the same as between regular paragraphs.

And here’s an example with a keyed frame:

DT = dt.Frame({"A": [1, 2, 3, 4, 5], "B": [4, 5, 6, 7, 8], "C": [7, 8, 9, 10, 11], "D": [5, 7, 2, 9, -1], "E": ['a','b','c','d','e']}) DT.key = ['E', 'D'] DT
EDABC
str32int32int32int32int32
a5147
b7258
c2369
d94710
e-15811

The following is a test for multi-line output from code samples:

for i in range(5): print(1/(4 - i))
0.25 0.3333333333333333 0.5 1.0
ZeroDivisionError: division by zero

The following is a plain piece of python code (i.e. without input/output sections):

#!/usr/bin/python import everything as nothing class MyClass(object): r""" Just some sample code """ def __init__(self, param1, param2): assert isinstance(param1, str) self.x = param1.lower() + "!!!" self.y = param2 @classmethod def enjoy(cls, item): print(str(cls) + " likes " + item) if __name__ == "__main__": data = [MyClass('abc', 2)] data += [1, 123, -14, +297, 2_300_000] data += [True, False, None] data += [0x123, 0o123, 0b111] data += [2.71, 1.23e+45, -1.0001e-11, -math.inf] data += ['abc', "def", """ghijk""", b"lmnop"] data += [f"This is an f-string {len(data)}.\n"] data += [r"\w+\n?\x0280\\ [abc123]+$", "\w+\n?\x0280\\ [abc123]+$",] data += [..., Ellipsis] # This cannot happen: if data and not data: assert AssertionError

Languages other than python are also supported. For example, the following is a shell code sample (console “language”):

$ # list all files in a directory $ ls -l total 48 -rw-r--r-- 1 pasha staff 804B Dec 10 09:14 Makefile drwxr-xr-x 24 pasha staff 768B Dec 10 09:14 api/ -rw-r--r-- 1 pasha staff 7.1K Dec 11 14:10 conf.py drwxr-xr-x 7 pasha staff 224B Dec 10 09:14 develop/ -rw-r--r-- 1 pasha staff 62B Jul 29 14:02 docutils.conf -rw-r--r-- 1 pasha staff 4.2K Dec 10 09:14 index.rst drwxr-xr-x 12 pasha staff 384B Dec 10 09:14 manual/ drwxr-xr-x 19 pasha staff 608B Dec 10 13:19 releases/ drwxr-xr-x 6 pasha staff 192B Dec 10 13:19 start/ $ # Here are some more advanced syntax elements: $ echo "PYTHONHOME = $PYTHONHOME" $ export TMP=/tmp $ docker run -it --init -v `pwd`:/cwd ${DOCKER_CONTAINER}

RST code sample:

Heading +++++++ Da-da-da-daaaaa, **BOOM**. Da-da-da-da-da-dA-da-da-da-da-da-da-da-da, DA-DA-DA-DA, DA-DA DA Da, BOOM, DAA-DA-DA-*DAAAAAAAAAAAA*, ty-dum - item 1 (:func:`foo() <dt.foo>`); - item 2; - ``item 3`` is a :ref:`reference`. .. custom-directive-here:: not ok :option1: value1 :option2: value2 Here be dragons .. _`plain target`: There could be other |markup| elements too, such as `links`_, and various inline roles, eg :sup:`what's up!`. Plain URLs probably won't be highlighted: http://datatable.com/. .. just a comment |here| .. yep:: still a comment .. _`example website`: https://example.com/

C++ code sample:

#include <cstring> int main() { return 0; }

SQL code sample:

SELECT * FROM students WHERE name='Robert'; DROP TABLE students; --'

Special care must be taken in case the code in the samples has very long line lengths. Generally, the code should not overflow its container block, or make the page wider than normal. Instead, the code block should get a horizontal scroll bar:

days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Extra Sunday", "Fourth Weekend"] print(days * 2)
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Extra Sunday', 'Fourth Weekend', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Extra Sunday', 'Fourth Weekend']

Same, but for a non-python code:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Admonitions

First, the .. note:: block, which should display in a prominent box:

Note

Please pay attention!

There’s usually some other content coming after the note. It should be properly spaced from the admonition.

Note

Here’s a note with more content. Why does it have so much content? Nobody knows. In theory admonitions should be short and to the point, but this one is not playing by the rules. It just goes on and on and on and on, and it seems like it would never end. Even as you think that maybe at last the end is near, a new paragraph suddenly appears:

And that new paragraph just repeats the same nonsense all over again. Really, there is no any good reason for it to keep going, but it does nevertheless, as if trying to stretch the limits of how many words can be spilled without any of them making any sense.

Note

  • First, this is a note with a list

  • Second, it may contain several list items

  • Third is here just to make things look more impressive

  • Fourth is the last one (or is it?)

X-versions

Added in version 0.8.0

The ..x-version-added directive usually comes as a first item after a header, so it has reduced margins from the top, and has to have adequate margins on the bottom to compensate.

Deprecated since version 0.10.0

The ..x-version-changed directive is a paragraph-level, and it usually has some additional content describing what exactly has changed.

Changed in version 0.9.0

Nobody knows what exactly changed, but most observers agree that something did.

While we’re trying to figure out what it was, please visit the release notes (linked above) and see if it makes sense to you.

Release History

Version 0.2.1

Version 0.2.1
Release date:2017-09-11

General

  • Created the CHANGELOG file.

  • sys.getsizeof(DT) can now be used to query the size of the datatable in memory.

  • Added a framework for computing and storing per-column summary statistics.

  • Implemented statistics min, max, mean, stdev, countna for numeric and boolean columns.

  • Getter df.internal.rowindex allows access to the RowIndex on the DataTable (for inspection/reuse).

  • In addition to LLVM4 environmental variable, datatable will now also look for the llvm4 folder within the package’s directory.

  • If d0 is a DataTable, then d1 = DataTable(d0) will create its shallow copy.

  • Environmental variable DTNOOPENMP will cause the datatable to be built without OpenMP support.

  • Filter function when applied to a view DataTable now produces correct result.

Contributors

This page lists all people who have contributed to the development of datatable. We take into account both code and documentation contributions, as well as contributions in the form of bug reports and feature requests.

More specifically, a code contribution is considered any PR (pull request) that was merged into the codebase. The “complexity” of the PR is not taken into account as it is highly subjective. Next, an issue contribution is any closed issue except for those that are tagged as “question”, “wont-fix” or “cannot-reproduce”. Issues are attributed according to their closing date, not their creation date.

In the table, the contributors are sorted according to their total contribution score, which is the weighted sum of the count of each user’s code and issue contributions. Code contributions have more weight than issue contributions, and more recent contributions more weight than the older ones.

1.00.110.100.90.80.7past
Pasha Stetsenko
Oleksiy Kononenko
Samuel Oranyeli
Michal Malohlava
Nishant Kalonia
Pradeep Krishnamurthy
Michal Raška
Arno Candel
Anmol Bal
Jan Gorecki
Jonathan McKinney
Siddhesh Poyarekar
Liu Chi
Viktor Demin
Hannah Tillman
Mallesham Yamulla
Bryce Boe
Juliano Faccioni
Wes Morgan
Angela Bartz
Achraf Merzouki
Bijan Pourhamzeh
Michael Frasco
Junghoo Cho
Corey Levinson
Tom Kraljevic
Jan Gamec
Suman Khanal
Navdeep Gill
Ben Gorman
Stephen Boesch
Patrick Rice
Jose Luis Avilez
Yu Zhu
NachiGithub
Hawk Berry
Olivier Grellier
Brannon King
Christopher Eeles
Darel13712
RaffaeleMorganti
coolyaolei
Patrick Shechet
reach4bawer
tbraun84
wjensheng
Ying Zhang
Michael Moroz
Ashrith Barthur
Chrinide
Lucas Jamar
Suren Mohanathas
Toby Dylan Hocking
Timothy Salazar
Megan Kurka
Andy Troiano
Martin Dvorak
Igor Šušić
Koray AL
Todd
Nick Kim
Zmnako Awrahman
Sinan
Andres Torrubia
Matt Dancho
Mateusz Dymczyk
sentieonycdev
Mathias Müller
Joseph Granados
Qiang Kou (KK)
Achille M.
Govind Mohan
Hemen Kapadia
Leland Wilkinson
Sri Ambati
Mark Chan

Developer’s note: This table is auto-generated based on contributor lists in each of the version files, specified via the ..contributors:: directive. In turn, the list of contributors for each version has to be generated via the script ci/gh.py at the time of each release. The issues/PRs will be filtered according to their milestone. Thus, the issues/PRs that are not tagged with any milestone will not be taken into account.