https://img.shields.io/pypi/v/datatable.svg https://travis-ci.org/h2oai/datatable.svg?branch=master
Python library for efficient multi-threaded data processing, with the support for out-of-memory datasets.

Introduction

Data is everywhere. From the smallest photon interactions to galaxy collisions, from mouse movements on a screen to economic developments of countries, we are surrounded by the sea of information. The human mind cannot comprehend this data in all its complexity; since ancient times people found it much easier to reduce the dimensionality, to impose a strict order, to arrange the data points neatly on a rectangular grid: to make a data table.

But once the data has been collected into a table, it has been tamed. It may still need some grooming and exercise, essentially so it is no longer scary. Even if it is really Big Data, with the right tools you can approach it, play with it, bend it to your will, master it.

Python datatable module is the right tool for the task. It is a library that implements a wide (and growing) range of operators for manipulating two-dimensional data frames. It focuses on: big data support, high performance, both in-memory and out-of-memory datasets, and multi-threaded algorithms. In addition, datatable strives to achieve good user experience, helpful error messages, and powerful API similar to R data.table’s.

Getting Started

Using datatable

This section describes common functionality and commands that you can run in datatable.

Create Frame

You can create a Frame from a variety of sources, including numpy arrays, pandas DataFrames, raw Python objects, etc:

import datatable as dt
import numpy as np
np.random.seed(1)
dt.Frame(np.random.randn(1000000))
C0
▪▪▪▪▪▪▪▪
01.62435
1−0.611756
2−0.528172
3−1.07297
40.865408
5−2.30154
61.74481
7−0.761207
80.319039
9−0.24937
999,9950.0595784
999,9960.140349
999,997−0.596161
999,9981.18604
999,9990.313398
import pandas as pd
pf = pd.DataFrame({"A": range(1000)})
dt.Frame(pf)
A
▪▪▪▪▪▪▪▪
00
11
22
33
44
55
66
77
88
99
995995
996996
997997
998998
999999
dt.Frame({"n": [1, 3], "s": ["foo", "bar"]})
ns
▪▪▪▪
01foo
13bar

Convert a Frame

Convert an existing Frame into a numpy array, a pandas DataFrame, or a pure Python object:

nparr = DT.to_numpy()
pddfr = DT.to_pandas()
pyobj = DT.to_list()

Parse Text (csv) Files

datatable provides fast and convenient parsing of text (csv) files:

DT = dt.fread("train.csv")

The datatable parser

  • Automatically detects separators, headers, column types, quoting rules, etc.

  • Reads from file, URL, shell, raw text, archives, glob

  • Provides multi-threaded file reading for maximum speed

  • Includes a progress indicator when reading large files

  • Reads both RFC4180-compliant and non-compliant files

Write the Frame

Write the Frame’s content into a csv file (also multi-threaded):

DT.to_csv("out.csv")

Save a Frame

Save a Frame into a binary format on disk, then open it later instantly, regardless of the data size:

DT.to_jay("out.jay")
DT2 = dt.open("out.jay")

Basic Frame Properties

Basic Frame properties include:

print(DT.shape)   # (nrows, ncols)
print(DT.names)   # column names
print(DT.stypes)  # column types

Compute Per-Column Summary Stats

Compute per-column summary stats using:

DT.sum()
DT.max()
DT.min()
DT.mean()
DT.sd()
DT.mode()
DT.nmodal()
DT.nunique()

Select Subsets of Rows/Columns

Select subsets of rows and/or columns using:

DT[:, "A"]         # select 1 column
DT[:10, :]         # first 10 rows
DT[::-1, "A":"D"]  # reverse rows order, columns from A to D
DT[27, 3]          # single element in row 27, column 3 (0-based)

Delete Rows/Columns

Delete rows and or columns using:

del DT[:, "D"]     # delete column D
del DT[f.A < 0, :] # delete rows where column A has negative values

Filter Rows

Filter rows via an expression using the following. In this example, mean, sd, f are all symbols imported from datatable:

DT[(f.x > mean(f.y) + 2.5 * sd(f.y)) | (f.x < -mean(f.y) - sd(f.y)), :]

Compute Columnar Expressions

Compute columnar expressions using:

DT[:, {"x": f.x, "y": f.y, "x+y": f.x + f.y, "x-y": f.x - f.y}]

Sort Columns

Sort columns using:

DT.sort("A")
DT[:, :, sort(f.A)]

Perform Groupby Calculations

Perform groupby calculations using:

DT[:, mean(f.x), by("y")]

Append Rows/Columns

Append rows/columns to a Frame using Frame.cbind():

DT1.cbind(DT2, DT3)
DT1.rbind(DT4, force=True)

User Guide

f-expressions

The datatable module exports a special symbol f, which can be used to refer to the columns of a frame currently being operated on. If this sounds cryptic, consider that the most common way to operate on a frame is via the square-bracket call DT[i, j, by, ...]. And it is often the case that within this expression you would want to refer to individual columns of the frame: either to create a filter, or a transform, or specify a grouping variable, etc. In all such cases the f symbol is used, and it is considered to be evaluated within the context of the frame DT.

For example, consider the expression:

f.price

By itself, it just means a column named “price”, in an unspecified frame. This expression becomes concrete, however, when used with a particular frame. For example:

train_dt[f.price > 0, :]

selects all rows in train_dt where the price is positive. Thus, within the call to train_dt[...], the symbol f refers to the frame train_dt.

The standalone f-expression may occasionally be useful too: it can be saved in a variable and then re-applied to several different frames. Each time f will refer to the frame to which it is being applied:

price_filter = (f.price > 0)
train_filtered = train_dt[price_filter, :]
test_filtered = test_dt[price_filter, :]

The simple expression f.price can be saved in a variable too. In fact, there is a Frame helper method .export_names() which does exactly that: returns a tuple of variables for each column name in the frame, allowing you to omit the f. prefix:

Id, Price, Quantity = DT.export_names()
DT[:, [Id, Price, Quantity, Price * Quantity]]

Single-column selector

As you have seen, the expression f.NAME refers to a column called “NAME”. This notation is handy, but not universal. What do you do if the column’s name contains spaces or unicode characters? Or if a column’s name is not known, only its index? Or if the name is in a variable? For these purposes f supports the square-bracket selectors:

f[-1]           # select the last column
f["Price ($)"]  # select column names "Price ($)"

Generally, f[i] means either the column at index i if i is an integer, or the column with name i if i is a string.

Using an integer index follows the standard Python rule for list subscripts: negative indices are interpreted as counting from the end of the frame, and requesting a column with an index outside of [-ncols; ncols) will raise an error.

This square-bracket form is also useful when you want to access a column dynamically, i.e. if its name is not known in advance. For example, suppose there is a frame with columns "2017_01", "2017_02", …, "2019_12". Then all these columns can be addressed as:

[f["%d_%02d" % (year, month)]
 for month in range(1, 13)
 for year in [2017, 2018, 2019]]

Multi-column selector

In the previous section you have seen that f[i] refers to a single column when i is either an integer or a string. However we alo support the case when i is a slice or a type:

f[:]          # select all columns
f[::-1]       # select all columns in reverse order
f[:5]         # select the first 5 columns
f[3:4]        # select the fourth column
f["B":"H"]    # select columns from B to H, inclusive
f[int]        # select all integer columns
f[float]      # select all floating-point columns
f[dt.str32]   # select all columns with stype `str32`
f[None]       # select no columns (empty columnset)

In all these cases a columnset is returned. This columnset may contain a variable number of columns or even no columns at all, depending on the frame to which this f-expression is applied.

Applying a slice to symbol f follows the same semantics as if f was a list of columns. Thus f[:10] means the first 10 columns of a frame, or all columns if the frame has less than 10. Similarly, f[9:10] selects the 10th column of a frame if it exists, or nothing if the frame has less than 10 columns. Compare this to selector f[9], which also selects the 10th column of a frame if it exists, but throws an exception if it doesn’t.

Besides the usual numeric ranges, you can also use name ranges. These ranges include the first named column, the last named column, and all columns in between. It is not possible to mix positional and named columns in a range, and it is not possible to specify a step. If the range is x:y, yet column x comes after y in the frame, then the columns will be selected in the reverse order: first x, then the column preceding x, and so on, until column y is selected last:

f["C1":"C9"]   # Select columns from C1 up to C9
f["C9":"C1"]   # Select columns C9, C8, C7, ..., C2, C1
f[:"C3"]       # Select all columns up to C3
f["C5":]       # Select all columns after C5

Finally, you can select all columns of a particular type by using that type as an f-selector. You can pass either common python types bool, int, float, str; or you can pass an stype such as dt.int32, or an ltype such as dt.ltype.obj. You can also pass None to not select any columns. By itself this may not be very useful, but occasionally you may need this as a fallback in conditional expressions:

f[int if select_types == "integer" else
  float if select_types == "floating" else
  None]  # otherwise don't select any columns

A columnset can be used in situations where a sequence of columns is expected, such as:

  • the j node of DT[i,j,...];

  • within by() and sort() functions;

  • with certain functions that operate on sequences of columns: rowsum(), rowmean, rowmin, etc;

  • many other functions that normally operate on a single column will automatically map over all columns in columnset:

    sum(f[:])       # equivalent to [sum(f[i]) for i in range(DT.ncols)]
    f[:3] + f[-3:]  # same as [f[0]+f[-3], f[1]+f[-2], f[2]+f[-1]]
    

New in version 0.10.0.

Modifying a columnset

Columnsets support operations that either add or remove elements from the set. This is done using methods .extend() and .remove().

The .extend() method takes a columnset as an argument (also a list, or dict, or sequence of columns) and produces a new columnset containing both the original and the new columns. The columns need not be unique: the same column may appear multiple times in a columnset. This method allows to add transformed columns into the columnset as well:

f[int].extend(f[float])          # integer and floating-point columns
f[:3].extend(f[-3:])             # the first and the last 3 columns
f.A.extend(f.B)                  # columns "A" and "B"
f[str].extend(dt.str32(f[int]))  # string columns, and also all integer
                                 # columns converted to strings
# All columns, and then one additional column named 'cost', which contains
# column `price` multiplied by `quantity`:
f[:].extend({"cost": f.price * f.quantity})

When a columnset is extended, the order of the elements is preserved. Thus, a columnset is closer in functionality to a python list than to a set. In addition, some of the elements in a columnset can have names, if the columnset is created from a dictionary. The names may be non-unique too.

The .remove() method is the opposite of .extend(): it takes an existing columnset and then removes all columns that are passed as the argument:

f[:].remove(f[str])    # all columns except columns of type string
f[:10].remove(f.A)     # the first 10 columns without column "A"
f[:].remove(f[3:-3])   # same as `f[:3].extend(f[-3:])`, at least in the
                       # context of a frame with 6+ columns

Removing a column that is not in the columnset is not considered an error, similar to how set-difference operates. Thus, f[:].remove(f.A) may be safely applied to a frame that doesn’t have column “A”: the columns that cannot be removed are simply ignored.

If a columnset includes some column several times, and then you request to remove that column, then only the first occurrence in the sequence will be removed. Generally, the multiplicity of some column “A” in columnset cs1.remove(cs2) will be equal the multiplicity of “A” in cs1 minus the multiplicity of “A” in cs2, or 0 if such difference would be negative. Thus,:

f[:].extend(f[int]).remove(f[int])

will have the effect of moving all integer columns to the end of the columnset (since .remove() removes the first occurrence of a column it finds).

It is not possible to remove a transformed column from a columnset. An error will be thrown if the argument of .remove() contains any transformed columns.

New in version 0.10.0.

Fread Examples

This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.

See fread() for all the available parameters.

Note: If you wish to read in multiple files, use iread(); it returns an iterator of Frames.

Read Data

  • Read from text file:

    from datatable import dt, fread
    
    result = fread('iris.csv')
    result.head(5)
    
        sepal_length    sepal_width     petal_length    petal_width     species
    0   5.1                 3.5             1.4             0.2         setosa
    1   4.9                 3               1.4             0.2         setosa
    2   4.7                 3.2             1.3             0.2         setosa
    3   4.6                 3.1             1.5             0.2         setosa
    4   5                   3.6             1.4             0.2         setosa
    
  • Read text data directly:

    data = ('col1,col2,col3\n'
            'a,b,1\n'
            'a,b,2\n'
            'c,d,3')
    
    fread(data)
    
    col1  col2  col3
     a     b     1
     a     b     2
     a     b     3
    
  • Read from a url:

    url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
    fread(url)
    
  • Read from an archive:

If there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in:

fread("data.zip/mtcars.csv")

Note: Use iread() if you wish to read in multiple files in an archive; an iterator of Frames is returned.

  • Read from .xls or .xlsx files

    fread("excel.xlsx")
    

For excel files, you can specify the sheet to be read:

fread("excel.xlsx/Sheet1")
Note:
  • xlrd must be installed to read in excel files.

  • Use iread() if you wish to read in multiple sheets; an iterator of Frames is returned.

  • Read in data from the command line. Simply pass the command line statement to the cmd parameter:

    #https://blog.jpalardy.com/posts/awk-tutorial-part-2/
    #You specify the `cmd` parameter
    #Here we filter data for the year 2015
    fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")
    

The command line can be very handy with large data; you can do some of the preprocessing before reading in the data to datatable.

Detect Thousand Separator

Fread handles thousand separator, with the assumption that the separator is a ,:

data = """Name|Salary|Position
         James|256,000|evangelist
        Ragnar|1,000,000|conqueror
          Loki|250360|trickster"""

fread(data)

    Name    Salary  Position
0   James   256000  evangelist
1   Ragnar  1000000 conqueror
2   Loki    250360  trickster

Specify the Delimiter

You can specify the delimiter via the sep parameter. Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]:

data = """
       1:2:3:4
       5:6:7:8
       9:10:11:12
       """

fread(data, sep=":")

    C0      C1      C2      C3
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

Dealing with Null Values and Blank Rows

You can pass a list of values to be treated as null, via the na_strings parameter:

data = """
       ID|Charges|Payment_Method
       634-VHG|28|Cheque
       365-DQC|33.5|Credit card
       264-PPR|631|--
       845-AJO|42.3|
       789-KPO|56.9|Bank Transfer
       """

fread(data, na_strings=['--', ''])

    ID          Charges  Payment_Method
0   634-VHG     28       Cheque
1   365-DQC     33.5     Credit card
2   264-PPR     631      NA
3   845-AJO     42.3     NA
4   789-KPO     56.9     Bank Transfer

For rows with less values than in other rows, you can set fill=True; fread will fill with NA:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11')

fread(data, fill=True)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      NA

You can skip empty lines:

data = ('a,b,c,d\n'
        '\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '\n'
        '9,10,11,12')

fread(data, skip_blank_lines=True)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

Dealing with Column Names

If the data has no headers, fread will assign default column names:

data = ('1,2\n'
        '3,4\n')

fread(data)

    C0      C1
0   1       2
1   3       4

You can pass in column names via the columns parameter:

fread(data, columns=['A','B'])

    A       B
0   1       2
1   3       4

You can change column names:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, columns=["A","B","C","D"])

    A       B       C       D
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

You can change some of the column names via a dictionary:

fread(data, columns={"a":"A", "b":"B"})

    A       B       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

Fread uses heuristics to determine whether the first row is data or not; occasionally it may guess incorrectly, in which case, you can set the header parameter to False:

fread(data,  header=False)


    C0      C1      C2      C3
0   a       b       c       d
1   1       2       3       4
2   5       6       7       8
3   9       10      11      12

You can pass a new list of column names as well:

fread(data,  header=False, columns=["A","B","C","D"])

    A       B       C       D
0   a       b       c       d
1   1       2       3       4
2   5       6       7       8
3   9       10      11      12

Row Selection

Fread has a skip_to_line parameter, where you can specify what line to read the data from:

data = ('skip this line\n'
        'a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, skip_to_line=2)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

You can also skip to a line containing a particular string with the skip_to_string parameter, and start reading data from that line. Note that skip_to_string and skip_to_line cannot be combined; you can only use one:

data = ('skip this line\n'
        'a,b,c,d\n'
        'first, second, third, last\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, skip_to_string='first')


    first   second  third   last
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

You can set the maximum number of rows to read with the max_nrows parameter:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, max_nrows=2)


    a       b       c       d
0   1       2       3       4
1   5       6       7       8

data = ('skip this line\n'
        'a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, skip_to_line=2, max_nrows=2)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8

Setting Column Type

You can determine the data types via the columns parameter:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

#this is useful when you are interested in only a subset of the columns
fread(data, columns={"a":dt.float32, "b":dt.str32})

You can also pass in the data types by position:

fread(data, columns = (stype.int32, stype.str32, stype.float32))

You can also change all the column data types with a single assignment:

fread(data, columns = dt.float32)

You can change the data type for a slice of the columns:

#this changes the data type to float for the first three columns
fread(data, columns={float:slice(3)})

Note that there are a small number of stypes within datatable (int8, int16, int32, int64, float32, float64, str32, str64, obj64, bool8)

Selecting Columns

There are various ways to select columns in fread :

  • Select with a dictionary:

        data = ('a,b,c,d\n'
                '1,2,3,4\n'
                '5,6,7,8\n'
                '9,10,11,12')
    
        #pass ``Ellipsis : None`` or ``... : None``,
        #to discard any columns that are not needed
        fread(data, columns={"a":"a", ... : None})
    
        a
    0   1
    1   5
    2   9
    

Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.

  • Select columns with a set:

    fread(data, columns={"a","b"})
    
        a       b
    0   1       2
    1   5       6
    2   9       10
    
  • Select range of columns with slice:

    #select the second and third column
    fread(data, columns=slice(1,3))
    
        b       c
    0   2       3
    1   6       7
    2   10      11
    
    #select the first column
    #jump two hoops and
    #select the third column
    fread(data, columns = slice(None,3,2))
    
        a       c
    0   1       3
    1   5       7
    2   9       11
    
  • Select range of columns with range:

    fread(data, columns = range(1,3))
    
        b       c
    0   2       3
    1   6       7
    2   10      11
    
  • Boolean Selection:

    fread(data, columns=[False, False, True, True])
    
        c       d
    0   3       4
    1   7       8
    2   11      12
    
  • Select with a list comprehension:

    fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols])
    
        a       c
    0   1       3
    1   5       7
    2   9       11
    
  • Exclude columns with None:

    fread(data, columns = ['a',None,None,'d'])
    
        a       d
    0   1       4
    1   5       8
    2   9       12
    
  • Exclude columns with list comprehension:

    fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols])
    
    
        b       d
    0   2       4
    1   6       8
    2   10      12
    
  • Drop columns by assigning None to the columns via a dictionary:

    data = ("A,B,C,D\n"
            "1,3,5,7\n"
            "2,4,6,8\n")
    
    fread(data, columns={"B":None,"D":None})
    
        A       C
    0   1       5
    1   2       6
    
  • Drop a column and change data type:

    fread(data, columns={"B":None, "C":str})
    
        A       C       D
    0   1       5       7
    1   2       6       8
    
  • Change column name and type, and drop a column:

     #pass a tuple, where the first item in the tuple is the new column name,
     #and the other item is the new data type.
    fread(data, columns={"A":("first", float), "B":None,"D":None})
    
        first   C
    0   1       5
    1   2       6
    

With list comprehensions, you can dynamically select columns:

#select columns that have length, and species column
fread('iris.csv',
  #use a boolean list comprehension to get the required columns
  columns = lambda cols : [(col.name=='species')
                           or ("length" in col.name)
                           for col in cols],
  max_nrows=5)

  sepal_length      petal_length    species
0   5.1                 1.4         setosa
1   4.9                 1.4         setosa
2   4.7                 1.3         setosa
3   4.6                 1.5         setosa
4   5                   1.4         setosa


#select columns by position
fread('iris.csv',
       columns = lambda cols : [ind in (1,4) for ind, col in enumerate(cols)],
       max_nrows=5)

    sepal_length    petal_length    petal_width
0   5.1                     1.4         0.2
1   4.9                     1.4         0.2
2   4.7                     1.3         0.2
3   4.6                     1.5         0.2
4   5                       1.4         0.2

Grouping with by

The by() modifier splits a dataframe into groups, either via the provided column(s) or f-expressions, and then applies i and j within each group. This split-apply-combine strategy allows for a number of operations:

  • Aggregations per group,

  • Transformation of a column or columns, where the shape of the dataframe is maintained,

  • Filtration, where some data are kept and the others discarded, based on a condition or conditions.

Aggregation

The aggregate function is applied in the j section.

  • Group by one column

from datatable import (dt, f, by, ifelse, update, sort,
                      count, min, max, mean, sum, rowsum)


df =  dt.Frame("""Fruit   Date       Name  Number
                  Apples  10/6/2016  Bob     7
                  Apples  10/6/2016  Bob     8
                  Apples  10/6/2016  Mike    9
                  Apples  10/7/2016  Steve  10
                  Apples  10/7/2016  Bob     1
                  Oranges 10/7/2016  Bob     2
                  Oranges 10/6/2016  Tom    15
                  Oranges 10/6/2016  Mike   57
                  Oranges 10/6/2016  Bob    65
                  Oranges 10/7/2016  Tony    1
                  Grapes  10/7/2016  Bob     1
                  Grapes  10/7/2016  Tom    87
                  Grapes  10/7/2016  Bob    22
                  Grapes  10/7/2016  Bob    12
                  Grapes  10/7/2016  Tony   15""")

df[:, sum(f.Number), by('Fruit')]

    Fruit   Number
0   Apples  35
1   Grapes  137
2   Oranges 140
  • Group by multiple columns

df[:, sum(f.Number), by('Fruit', 'Name')]

    Fruit   Name    Number
0   Apples  Bob 16
1   Apples  Mike    9
2   Apples  Steve   10
3   Grapes  Bob 35
4   Grapes  Tom 87
5   Grapes  Tony    15
6   Oranges Bob 67
7   Oranges Mike    57
8   Oranges Tom 15
9   Oranges Tony    1
  • By column position

df[:, sum(f.Number), by(f[0])]

    Fruit   Number
0   Apples  35
1   Grapes  137
2   Oranges 140
  • By boolean expression

df[:, sum(f.Number), by(f.Fruit == "Apples")]

    C0  Number
0   0   277
    1   35
  • Combination of column and boolean expression

df[:, sum(f.Number), by(f.Name, f.Fruit == "Apples")]

   Name C0  Number
0   Bob 0   102
1   Bob 1   16
2   Mike    0   57
3   Mike    1   9
4   Steve   1   10
5   Tom 0   102
6   Tony    0   16
  • The grouping column can be excluded from the final output

df[:, sum(f.Number), by('Fruit', add_columns=False)]

    Number
0   35
1   137
2   140
Note:
  • The resulting dataframe has the grouping column(s) as the first column(s).

  • The grouping columns are excluded from j, unless explicitly included.

  • The grouping columns are sorted in ascending order.

  • Apply multiple aggregate functions to a column in the j section

df[:, {"min": min(f.Number),
       "max": max(f.Number)},
  by('Fruit','Date')]

    Fruit   Date           min  max
0   Apples  10/6/2016   7   9
1   Apples  10/7/2016   1   10
2   Grapes  10/7/2016   1   87
3   Oranges 10/6/2016   15  65
4   Oranges 10/7/2016   1   2
  • Functions can be applied across a columnset

    • Task : Get sum of col3 and col4, grouped by col1 and col2

df = dt.Frame(""" col1   col2   col3   col4   col5
                  a      c      1      2      f
                  a      c      1      2      f
                  a      d      1      2      f
                  b      d      1      2      g
                  b      e      1      2      g
                  b      e      1      2      g""")

df[:, sum(f["col3":"col4"]), by('col1', 'col2')]

          col1    col2    col3    col4
0     a   c   2   4
1     a   d   1   2
2     b   d   1   2
3         b   e   2   4
  • Apply different aggregate functions to different columns

df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]

    col1    col2    col3    col4
0   a   c   1   2
1   a   d   1   2
2   b   d   1   2
3   b   e   1   2
  • Nested aggregations in j

    • Task : Group by column idx and get the row sum of A and B, C and D

df = dt.Frame(""" idx  A   B   C   D   cat
                   J   1   2   3   1   x
                   K   4   5   6   2   x
                   L   7   8   9   3   y
                   M   1   2   3   4   y
                   N   4   5   6   5   z
                   O   7   8   9   6   z""")

df[:,
    {"AB" : sum(rowsum(f['A':'B'])),
     "CD" : sum(rowsum(f['C':'D']))},
   by('cat')
   ]

  cat AB  CD
0   x 12  12
1   y 18  19
2   z 24  26
  • Computation between aggregated columns

    • Task : Get the difference between the largest and smallest value within each group

df = dt.Frame("""GROUP VALUE
                  1     5
                  2     2
                  1     10
                  2     20
                  1     7""")

df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]

    GROUP   C0
0    1     5
1    2     18
  • Null values are not excluded from the grouping column

df = dt.Frame("""  a    b    c
                   1    2.0  3
                   1    NaN  4
                   2    1.0  3
                   1    2.0  2""")

df[:, sum(f[:]), by('b')]

        b   a   c
    0   NA  1   4
    1   1   2   3
    2   2   2   5

If you wish to ignore null values, first filter them out

df[f.b != None, :][:, sum(f[:]), by('b')]

    b   a   c
0   1   2   3
1   2   2   5

Filtration

This occurs in the i section of the groupby, where only a subset of the data per group is needed; selection is limited to integers or slicing.

Note:
  • i is applied after the grouping, not before.

  • f-expressions in the i section is not yet implemented for groupby.

  • Select the first row per group

df = dt.Frame("""A   B
                 1  10
                 1  20
                 2  30
                 2  40
                 3  10""")

# passing 0 as index gets the first row after the grouping
# note that python's index starts from 0, not 1

df[0, :, by('A')]

    A   B
0   1   10
1   2   30
2   3   10
  • Select the last row per group

df[-1, :, by('A')]

    A   B
0   1   20
1   2   40
2   3   10
  • Select the nth row per group

    • Task : select the second row per group

df[1, :, by('A')]

    A    B
0   1   20
1   2   40
Note:
  • Filtering this way can be used to drop duplicates; you can decide to keep the first or last non-duplicate.

  • Select the latest entry per group

df   =  dt.Frame("""id    product   date
                    220    6647     2014-09-01
                    220    6647     2014-09-03
                    220    6647     2014-10-16
                    826    3380     2014-11-11
                    826    3380     2014-12-09
                    826    3380     2015-05-19
                    901    4555     2014-09-01
                    901    4555     2014-10-05
                    901    4555     2014-11-01""")

df[-1, :, by('id'), sort('date')]

    id  product date
0   220 6647    2014-10-16
1   826 3380    2015-05-19
2   901 4555    2014-11-01
Note:

-If sort and by modifiers are present, the sorting occurs after the grouping, and occurs within each group.

  • Replicate SQL’s HAVING clause

    • Task: Filter for groups where the length/count is greater than 1

df = dt.Frame([[1, 1, 5], [2, 3, 6]], names=['A', 'B'])

df
    A   B
0   1   2
1   1   3
2   5   6

# Get the count of each group,
# and assign to a new column, using the update method
# note that the update operation is in-place;
# there is no need to assign back to the dataframe

df[:, update(filter_col = count()), by('A')]

# The new column will be added to the end
# We use an f-expression to return rows
# in each group where the count is greater than 1

df[f.filter_col > 1, f[:-1]]

    A   B
0   1   2
1   1   3
  • Keep only rows per group where diff is the minimum

df = dt.Frame(""" item    diff   otherstuff
                    1       2            1
                    1       1            2
                    1       3            7
                    2      -1            0
                    2       1            3
                    2       4            9
                    2      -6            2
                    3       0            0
                    3       2            9""")

df[:,
   #get boolean for rows where diff column is minimum for each group
   update(filter_col = f.diff == min(f.diff)),
   by('item')]

df[f.filter_col == 1, :-1]

   item diff  otherstuff
0       1      1             2
1       2      -6            2
2       3      0             0
  • Keep only entries where make has both 0 and 1 in sales

df  =  dt.Frame(""" make    country  other_columns   sale
                    honda    tokyo       data          1
                    honda    hirosima    data          0
                    toyota   tokyo       data          1
                    toyota   hirosima    data          0
                    suzuki   tokyo       data          0
                    suzuki   hirosima    data          0
                    ferrari  tokyo       data          1
                    ferrari  hirosima    data          0
                    nissan   tokyo       data          1
                    nissan   hirosima    data          0""")

df[:,
   update(filter_col = sum(f.sale)),
   by('make')]

df[f.filter_col == 1, :-1]

    make     country  other_columns   sale
0   honda    tokyo          data        1
1   honda    hirosima   data        0
2   toyota   tokyo          data        1
3   toyota   hirosima   data        0
4   ferrari  tokyo          data        1
5   ferrari  hirosima   data        0
6   nissan   tokyo          data        1
7   nissan   hirosima   data        0

Transformation

This is when a function is applied to a column after a groupby and the resulting column is appended back to the dataframe. The number of rows of the dataframe is unchanged. The update() method makes this possible and easy. Let’s look at a couple of examples:

  • Get the minimum and maximum of column c per group, and append to dataframe

df  =  dt.Frame(""" c     y
                    9     0
                    8     0
                    3     1
                    6     2
                    1     3
                    2     3
                    5     3
                    4     4
                    0     4
                    7     4""")

# Assign the new columns via the update method

df[:,
   update(min_col = min(f.c),
          max_col = max(f.c)),
  by('y')]

df
            c   y   min_col  max_col
    0   9   0   8   9
    1   8   0   8   9
    2   3   1   3   3
    3   6   2   6   6
    4   1   3   1   5
    5   2   3   1   5
    6   5   3   1   5
    7   4   4   0   7
    8   0   4   0   7
    9   7   4   0   7
  • Fill missing values by group mean

df = dt.Frame({'value' : [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
               'name' : ['A','A', 'B','B','B','B', 'C','C','C']})

df
    value   name
0   1   A
1   NA  A
2   NA  B
3   2   B
4   3   B
5   1   B
6   3   C
7   NA  C
8   3   C

# This uses a combination of update and ifelse methods:

df[:,
   update(value = ifelse(f.value == None,
                         mean(f.value),
                         f.value)),
   by('name')]

df
    value   name
0   1   A
1   1   A
2   2   B
3   2   B
4   3   B
5   1   B
6   3   C
7   3   C
8   3   C
  • Transform and Aggregate on Multiple Columns

    • Task: Get the sum of the aggregate of column a and b, grouped by c and d and append to dataframe.

df = dt.Frame({'a' : [1,2,3,4,5,6],
               'b' : [1,2,3,4,5,6],
               'c' : ['q', 'q', 'q', 'q', 'w', 'w'],
               'd' : ['z','z','z','o','o','o']})
df

      a   b   c   d
0     1   1   q   z
1     2   2   q   z
2     3   3   q   z
3     4   4   q   o
4     5   5   w   o
5     6   6   w   o


df[:,
   update(e = sum(f.a) + sum(f.b)),
   by('c', 'd')
   ]

df

      a   b   c   d   e
0     1   1   q   z   12
1     2   2   q   z   12
2     3   3   q   z   12
3     4   4   q   o   8
4     5   5   w   o   22
5     6   6   w   o   22
  • Replicate R’s groupby mutate

    • Task : Get ratio by dividing column c by the product of column c and d, grouped by a and b

df = dt.Frame(dict(a = (1,1,0,1,0),
                   b = (1,0,0,1,0),
                   c = (10,5,1,5,10),
                   d = (3,1,2,1,2))
              )

df
    a   b   c   d
0   1   1   10      3
1   1   0   5   1
2   0       0   1   2
3   1   1   5   1
4   0   0   10      2

df[:,
   update(ratio = f.c / sum(f.c * f.d)),
   by('a', 'b')
   ]

df

      a   b   c   d   ratio
  0   1   1   10  3   0.285714
  1   1   0   5   1   1
  2   0   0   1   2   0.0454545
  3   1   1   5   1   0.142857
  4   0   0   10  2   0.454545

Groupby on Boolean Expressions

  • Conditional Sum with groupby

    • Task : Sum data1 column, grouped by key1 and rows where key2== "one"

df = dt.Frame("""data1        data2     key1  key2
                 0.361601    0.375297     a    one
                 0.069889    0.809772     a    two
                 1.468194    0.272929     b    one
                -1.138458    0.865060     b    two
                -0.268210    1.250340     a    one""")


df[:,
   sum(f.data1),
   by(f.key2 == "one", f.key1)][f.C0 == 1, 1:]

    key1  data1
0    a    0.093391
1    b    1.46819
  • Conditional Sums based on various Criteria

df = dt.Frame(""" A_id       B       C
                    a1      "up"     100
                    a2     "down"    102
                    a3      "up"     100
                    a3      "up"     250
                    a4     "left"    100
                    a5     "right"   102""")

df[:,
   {"sum_up": sum(f.B == "up"),
    "sum_down" : sum(f.B == "down"),
    "over_200_up" : sum((f.B == "up") & (f.C > 200))
    },
   by('A_id')]

   A_id sum_up  sum_down  over_200_up
0   a1    1      0          0
1   a2    0      1          0
2   a3    2      0          1
3   a4    0      0          0
4   a5    0      0          0

More Examples

  • Aggregation on Values in a Column

    • Task : group by Day and find minimum Data_Value for TMIN and maximum Data_Value for TMAX

df = dt.Frame("""  Day    Element  Data_Value
                  01-01   TMAX    112
                  01-01   TMAX    101
                  01-01   TMIN    60
                  01-01   TMIN    0
                  01-01   TMIN    25
                  01-01   TMAX    113
                  01-01   TMAX    115
                  01-01   TMAX    105
                  01-01   TMAX    111
                  01-01   TMIN    44
                  01-01   TMIN    83
                  01-02   TMAX    70
                  01-02   TMAX    79
                  01-02   TMIN    0
                  01-02   TMIN    60
                  01-02   TMAX    73
                  01-02   TMIN    31
                  01-02   TMIN    26
                  01-02   TMAX    71
                  01-02   TMIN    26""")

df[:,
   f.Day.extend({"TMAX" : max(ifelse(f.Element=="TMAX",
                                     f.Data_Value, None)),

                 "TMIN" : min(ifelse(f.Element=="TMIN",
                                     f.Data_Value, None)}))
   ]

    Day     TMAX  TMIN
0   01-01   115   0
1   01-02   79    0
  • Group By and Conditional Sum and add Back to Data Frame

    • Task: Sum the Count value for each ID, when Num is (17 or 12) and Letter is ‘D’ and also add the calculation back to the original data frame as ‘Total’

df =   dt.Frame(""" ID  Num  Letter  Count
                     1   17   D       1
                     1   12   D       2
                     1   13   D       3
                     2   17   D       4
                     2   12   A       5
                     2   16   D       1
                     3   16   D       1""")

expression = ((f.Num==17) | (f.Num==12)) & (f.Letter == "D")

df[:,
   update(Total = sum(ifelse(expression, f.Count, 0))),
   by('ID')]

df

      ID  Num Letter  Count   Total
0     1   17  D   1   3
1     1   12  D   2   3
2     1   13  D   3   3
3     2   17  D   4   4
4     2   12  A   5   4
5     2   16  D   1   4
6     3   16  D   1   0
  • Multiple indexing with multiple min and max in one aggregate

    • Task : find col1 where col2 is max, col2 where col3 is min and col1 where col3 is max

df = dt.Frame({
               "id" : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
               "col1" : [1, 3, 5, 2, 5, 3, 6, 3, 67, 7],
               "col2" : [4, 6, 8, 3, 65, 3, 5, 4, 4, 7],
               "col3" : [34, 64, 53, 5, 6, 2, 4, 6, 4, 67],
               })

df

         id   col1    col2    col3
0     1   1   4   34
1     1   3   6   64
2     1   5   8   53
3     2   2   3   5
4     2   5   65  6
5     2   3   3   2
6     2   6   5   4
7         3   3   4   6
8     3   67  4   4
9     3   7   7   67

df[:,
   {'col1' : max(ifelse(f.col2 == max(f.col2),
                        f.col1, None)),

    'col2' : max(ifelse(f.col3 == min(f.col3),
                        f.col2, None)),

    'col3' : max(ifelse(f.col3 == max(f.col3),
                        f.col1, None))
    },
   by('id')]

      id  col1    col2    col3
0     1   5   4   3
1     2   5   3   5
2     3   7   4   7
  • Filter row based on aggregate value

    • Task : Find, for every word, the tag that has the most count

df = dt.Frame("""word  tag count
                  a     S    30
                  the   S    20
                  a     T    60
                  an    T    5
                  the   T    10""")

# The solution builds on the knowledge that sorting
# while grouping sorts within each group.
df[0, :, by('word'), sort(-f.count)]

  word      tag     count
0   a       T       60
1   an      T       5
2   the     S       20
  • Get the rows where the value column is minimum, and rename columns

df = dt.Frame({"category": ["A"]*3 + ["B"]*3,
               "date": ["9/6/2016", "10/6/2016",
                        "11/6/2016", "9/7/2016",
                        "10/7/2016", "11/7/2016"],
               "value": [7,8,9,10,1,2]})

df
    category     date         value
0   A           9/6/2016        7
1   A           10/6/2016       8
2   A           11/6/2016       9
3   B           9/7/2016        10
4   B           10/7/2016       1
5   B           11/7/2016       2

df[0,
   {"value_date": f.date,
    "value_min":  f.value},
  by("category"),
  sort('value')]

  category  value_date   value_min
0   A        9/6/2016        7
1   B        10/7/2016       1
  • Using the same data in the last example, get the rows where the value column is maximum, and rename columns

df[0,
   {"value_date": f.date,
    "value_max":  f.value},
  by("category"),
  sort(-f.value)]

  category  value_date   value_max
0   A       11/6/2016       9
1   B       9/7/2016        10
  • Get the average of the last three instances per group

import random
random.seed(3)

df = dt.Frame({"Student": ["Bob", "Bill",
                           "Bob", "Bob",
                           "Bill","Joe",
                           "Joe", "Bill",
                           "Bob", "Joe",],
               "Score": random.sample(range(10,30), 10)})

df

    Student Score
0   Bob     17
1   Bill    28
2   Bob     27
3   Bob     14
4   Bill    21
5   Joe     24
6   Joe     19
7   Bill    29
8   Bob     20
9   Joe     23

df[-3:, mean(f[:]), f.Student]

  Student   Score
0   Bill    26
1   Bob     20.3333
2   Joe     22
  • Group by on a condition

    • Get the sum of Amount for Number in range (1 to 4) and (5 and above)

df = dt.Frame("""Number, Amount
                    1,     5
                    2,     10
                    3,     11
                    4,     3
                    5,     5
                    6,     8
                    7,     9
                    8,     6""")

df[:, sum(f.Amount), by(ifelse(f.Number>=5, "B","A"))]

    C0      Amount
0   A       29
1   B       28

Row Functions

Functions rowall, rowany, rowcount, rowfirst, rowlast, rowmax, rowmean, rowmin, rowsd, rowsum are functions that aggregate across rows instead of columns and return a single column. These functions are equivalent to Pandas aggregation functions with parameter (axis=1).
These functions make it easy to compute rowwise aggregations - for instance, you may want the sum of columns A, B, C and D. You could say : f.A + f.B + f.C + f.D. Rowsum makes it easier - dt.rowsum(f['A':'D']).

Rowall, Rowany

These work only on Boolean expressions - rowall checks if all the values in the row are True, while rowany checks if any value in the row is True. It is similar to Pandas’ all or any with a parameter of (axis=1). A single Boolean column is returned.

from datatable import dt, f, by

df = dt.Frame({'A': [True, True], 'B': [True, False]})
df

    A       B
0   1       1
1   1       0

# rowall :
df[:, dt.rowall(f[:])]

    C0
0   1
1   0

# rowany :
df[:, dt.rowany(f[:])]

    C0
0   1
1   1

The single boolean column that is returned can be very handy when filtering in the i section.

  • Filter for rows where at least one cell is greater than 0

    df = dt.Frame({'a': [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
                   'b': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                   'c': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0],
                   'd': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
                   'e': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                   'f': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]})
    
    df
    
        a       b       c       d       e       f
    0   0       0       0       0       0       0
    1   0       0       0       0       0       1
    2   0       0       0       0       0       0
    3   0       0       0       0       0       0
    4   0       0       0       0       0       0
    5   0       0       5       0       0       0
    6   1       0       0       0       0       0
    7   0       0       0       0       0       0
    8   0       0       0       1       0       0
    9   1       0       0       0       0       0
    10  0       0       0       0       0       0
    
    df[dt.rowany(f[:] > 0), :]
    
    
        a       b       c       d       e       f
    0   0       0       0       0       0       1
    1   0       0       5       0       0       0
    2   1       0       0       0       0       0
    3   0       0       0       1       0       0
    4   1       0       0       0       0       0
    
  • Filter for rows where all the cells are 0

    df[dt.rowall(f[:] == 0), :]
    
    
        a       b       c       d       e       f
    0   0       0       0       0       0       0
    1   0       0       0       0       0       0
    2   0       0       0       0       0       0
    3   0       0       0       0       0       0
    4   0       0       0       0       0       0
    5   0       0       0       0       0       0
    
  • Filter for rows where all the columns’ values are the same

    df = dt.Frame("""Name  A1   A2  A3  A4
                     deff  0    0   0   0
                     def1  0    1   0   0
                     def2  0    0   0   0
                     def3  1    0   0   0
                     def4  0    0   0   0""")
    
    # compare the first integer column with the rest,
    # use rowall to find rows where all is True
    # and filter with the resulting boolean
    df[dt.rowall(f[1]==f[1:]), :]
    
        Name    A1      A2      A3      A4
    0   deff    0       0       0       0
    1   def2    0       0       0       0
    2   def4    0       0       0       0
    
  • Filter for rows where the values are increasing

    df = dt.Frame({"A": [1, 2, 6, 4],
                   "B": [2, 4, 5, 6],
                   "C": [3, 5, 4, 7],
                   "D": [4, -3, 3, 8],
                   "E": [5, 1, 2, 9]})
    
    df
    
        A       B       C       D       E
    0   1       2       3       4       5
    1   2       4       5       −3      1
    2   6       5       4       3       2
    3   4       6       7       8       9
    
    df[dt.rowall(f[1:] >= f[:-1]), :]
    
        A       B       C       D       E
    0   1       2       3       4       5
    1   4       6       7       8       9
    

Rowfirst, Rowlast

These look for the first and last non-missing value in a row respectively.

df = dt.Frame({'A':[1, None, None, None],
               'B':[None, 3, 4, None],
               'C':[2, None, 5, None]})
df

    A       B       C
0   1       NA      2
1   NA      3       NA
2   NA      4       5
3   NA      NA      NA

# rowfirst :
df[:, dt.rowfirst(f[:])]

    C0
0   1
1   3
2   4
3   NA

# rowlast :
df[:, dt.rowlast(f[:])]

    C0
0   2
1   3
2   5
3   NA
  • Get rows where the last value in the row is greater than the first value in the row

    df = dt.Frame({'a': [50, 40, 30, 20, 10],
                   'b': [60, 10, 40, 0, 5],
                   'c': [40, 30, 20, 30, 40]})
    
    df
    
        a       b       c
    0   50      60      40
    1   40      10      30
    2   30      40      20
    3   20      0       30
    4   10      5       40
    
    df[dt.rowlast(f[:]) > dt.rowfirst(f[:]), :]
    
        a       b       c
    0   20      0       30
    1   10      5       40
    

Rowmax, Rowmin

These get the maximum and minimum values per row, respectively.

df = dt.Frame({"C": [2, 5, 30, 20, 10],
               "D": [10, 8, 20, 20, 1]})

df

    C       D
0   2       10
1   5       8
2   30      20
3   20      20
4   10      1

# rowmax
df[:, dt.rowmax(f[:])]

    C0
0   10
1   8
2   30
3   20
4   10

# rowmin
df[:, dt.rowmin(f[:])]

    C0
0   2
1   5
2   20
3   20
4   1
  • Find the difference between the maximum and minimum of each row

    df = dt.Frame("""Value1  Value2  Value3  Value4
                        5       4      3        2
                        4       3      2        1
                        3       3      5        1""")
    
    df[:, dt.update(max_min = dt.rowmax(f[:]) - dt.rowmin(f[:]))]
    df
    
        Value1  Value2  Value3  Value4  max_min
            5       4       3       2       3
            4       3       2       1       3
            3       3       5       1       4
    

Rowsum, Rowmean, Rowcount, Rowsd

rowsum and rowmean get the sum and mean of rows respectively; rowcount counts the number of non-missing values in a row, while rowsd aggregates a row to get the standard deviation

  • Get the count, sum, mean and standard deviation for each row

    df = dt.Frame("""ORD  A   B   C    D
                    198  23  45  NaN  12
                    138  25  NaN NaN  62
                    625  52  36  49   35
                    457  NaN NaN NaN  82
                    626  52  32  39   45""")
    
    df[:, dt.update(rowcount = dt.rowcount(f[:]),
                    rowsum = dt.rowsum(f[:]),
                    rowmean = dt.rowmean(f[:]),
                    rowsd = dt.rowsd(f[:])
                    )]
    
    df
    
        ORD     A       B       C       D       rowcount  rowsum  rowmean  rowsd
    0   198     23      45      NA      12          4       278     69.5   86.7583
    1   138     25      NA      NA      62          3       225     75     57.6108
    2   625     52      36      49      35          5       797     159.4  260.389
    3   457     NA      NA      NA      82          2       539     269.5  265.165
    4   626     52      32      39      45          5       794     158.8  261.277
    
  • Find rows where the number of nulls is greater than 3

    df = dt.Frame({'city': ["city1", "city2", "city3", "city4"],
                   'state': ["state1", "state2", "state3", "state4"],
                   '2005': [144, 205, 123, None],
                   '2006': [173, 211, 123, 124],
                   '2007': [None, None, None, None],
                   '2008': [None, 206, None, None],
                   '2009': [None, None, 124, 123],
                   '2010': [128, 273, None, None]})
    
    df
    
        city    state   2005    2006    2007    2008    2009    2010
    0   city1   state1  144     173     NA      NA      NA      128
    1   city2   state2  205     211     NA      206     NA      273
    2   city3   state3  123     123     NA      NA      124     NA
    3   city4   state4  NA      124     NA      NA      123     NA
    
    # get columns that are null, then sum on the rows
    # and finally filter where the sum is greater than 3
    df[dt.rowsum(dt.isna(f[:])) > 3, :]
    
        city    state   2005    2006    2007    2008    2009    2010
    0   city4   state4  NA      124     NA      NA      123     NA
    
  • Rowwise sum of the float columns

    df = dt.Frame("""ID   W_1       W_2     W_3
                     1    0.1       0.2     0.3
                     1    0.2       0.4     0.5
                     2    0.3       0.3     0.2
                     2    0.1       0.3     0.4
                     2    0.2       0.0     0.5
                     1    0.5       0.3     0.2
                     1    0.4       0.2     0.1""")
    
    df[:, dt.update(sum_floats = dt.rowsum(f[float]))]
    
        ID      W_1     W_2     W_3     sum_floats
    0   1       0.1     0.2     0.3     0.6
    1   1       0.2     0.4     0.5     1.1
    2   2       0.3     0.3     0.2     0.8
    3   2       0.1     0.3     0.4     0.8
    4   2       0.2     0       0.5     0.7
    5   1       0.5     0.3     0.2     1
    6   1       0.4     0.2     0.1     0.7
    

More Examples

  • Divide columns A, B, C, D by the total column, square it and sum rowwise

    df = dt.Frame({'A': [2, 3],
                   'B': [1, 2],
                   'C': [0, 1],
                   'D': [1, 0],
                   'total': [4, 6]})
    df
    
        A       B       C       D       total
    0   2       1       0       1       4
    1   3       2       1       0       6
    
    df[:, update(result = dt.rowsum((f[:-1]/f[-1])**2))]
    df
    
        A       B       C       D       total   result
    0   2       1       0       1       4       0.375
    1   3       2       1       0       6       0.388889
    
  • Get the row sum of the COUNT columns

    df = dt.Frame("""USER OBSERVATION COUNT.1 COUNT.2 COUNT.3
                        A    1           0       1       1
                        A    2           1       1       2
                        A    3           3       0       0""")
    
    columns = [f[column] for column in df.names if column.startswith("COUNT")]
    df[:, update(total = dt.rowsum(columns))]
    df
        USER    OBSERVATION     COUNT.1 COUNT.2 COUNT.3 total
    0     A         1             0         1       1     2
    1     A         2             1         1       2     4
    2     A         3             3         0       0     3
    
  • Sum selected columns rowwise

    df = dt.Frame({'location' : ("a","b","c","d"),
                   'v1' : (3,4,3,3),
                   'v2' : (4,56,3,88),
                   'v3' : (7,6,2,9),
                   'v4':  (7,6,1,9),
                   'v5' : (4,4,7,9),
                   'v6' : (2,8,4,6)})
    
    df
        location        v1      v2      v3      v4      v5      v6
    0       a           3       4       7       7       4       2
    1       b           4       56      6       6       4       8
    2       c           3       3       2       1       7       4
    3       d           3       88      9       9       9       6
    
    df[:, {"x1": dt.rowsum(f[1:4]), "x2": dt.rowsum(f[4:])}]
    
        x1      x2
    0   14      13
    1   66      18
    2   8       12
    3   100     24
    

Comparison with R’s data.table

datatable is closely related to R’s data.table attempts to mimic its core algorithms and API; however, there are differences due to language constraints.

This page shows how to perform similar basic operations in R’s data.table versus datatable.

Subsetting Rows

The examples used here are from the examples data in R’s data.table.

data.table:

library(data.table)

DT = data.table(x=rep(c("b","a","c"),each=3),
                y=c(1,3,6), v=1:9)

datatable:

from datatable import dt, f, g, by, update, join, sort

DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
          y = [1, 3, 6] * 3,
          v = range(1, 10))

Action

data.table

datatable

Select 2nd row

DT[2]

DT[1, :]

Select 2nd and 3rd row

DT[2:3]

DT[1:3, :]

Select 3rd and 2nd row

DT[3:2]

DT[[2,1], :]

Select 2nd and 5th rows

DT[c(2,5)]

DT[[1,4], :]

Select all rows from 2nd to 5th

DT[2:5]

DT[[1:5, :]

Select rows in reverse from 5th to the 1st

DT[5:1]

DT[4::-1, :]

Select the last row

DT[.N]

DT[-1, :]

All rows where y > 2

DT[y>2]

DT[f.y>2, :]

Compound logical expressions

DT[y>2 & v>5]

DT[(f.y>2) & (f.v>5), :]

All rows other than rows 2,3,4

DT[!2:4] or DT[-(2:4)]

DT[[0, slice(4, None)], :]

Sort by column x, ascending

DT[order(x), ]

DT.sort("x") or
DT[:, :, sort("x")]

Sort by column x, descending

DT[order(-x)]

DT.sort(-f.x) or
DT[:, :, sort(-f.x)]

Sort by column x ascending, y descending

DT[order(x, -y)]

DT.sort(x, -f.y) or
DT[:, :, sort(f.x, -f.y)]

Note the use of the f symbol when performing computations or sorting in descending order. You can read more about f-expressions.

Note: In data.table, DT[2] would mean 2nd row, whereas in datatable, DT[2] would select the 3rd column.

Selecting Columns

Action

data.table

datatable

Select column v

DT[, .(v)]

DT[:, 'v'] or DT['v']

Select multiple columns

DT[, .(x,v)]

DT[:, ['x', 'v']]

Rename and select column

DT[, .(m = x)]

DT[:, {"m" : f.x}]

Sum column v and rename as sv

DT[, .(sv=sum(v))]

DT[:, {"sv": dt.sum(f.v)}]

Return two columns, v and v doubled

DT[, .(v, v*2)]

DT[:, [f.v, f.v*2]]

Select the second column

DT[, 2]

DT[:, 1] or DT[1]

Select last column

DT[, ncol(DT), with=FALSE]

DT[:, -1]

Select columns x through y

DT[, .SD, .SDcols=x:y]

DT[:, f["x":"y"]] or DT[:, 'x':'y']

Exclude columns x and y

DT[ , .SD, .SDcols = !x:y]

DT[:, [name not in ("x","y")
for name in DT.names]] or
DT[:, f[:].remove(f['x':'y'])]

Select columns that start with x or v

DT[ , .SD, .SDcols = patterns('^[xv]')]

DT[:, [name.startswith(("x", "v"))
for name in DT.names]]

In data.table, you can select a column by using a variable name with the double dots prefix

cols = 'v'
DT[, ..cols]

In datatable, you do not need the prefix

cols = 'v'
DT[cols] # or  DT[:, cols]

If the column names are stored in a character vector, the double dots prefix also works

cols = c('v', 'y')
DT[, ..cols]

In datatable, you can store the list/tuple of column names in a variable

cols = ('v', 'y')
DT[:, cols]

Subset rows and Select/Aggregate

Action

data.table

datatable

Sum column v over rows 2 and 3

DT[2:3, .(sum(v))]

DT[1:3, dt.sum(f.v)]

Same as above, new column name

DT[2:3, .(sv=sum(v))]

DT[1:3, {"sv": dt.sum(f.v)}]

Filter in i and aggregate in j

DT[x=="b", .(sum(v*y))]

DT[f.x=="b", dt.sum(f.v * f.y)]

Same as above, return as scalar

DT[x=="b", sum(v*y)]

DT[f.x=="b", dt.sum(f.v * f.y)][0, 0]

In R, indexing starts at 1 and when slicing, the first and last items are included. However, in Python, indexing starts at 0, and when slicing, all items except the last are included.

Some SD(Subset of Data) operations can be replicated in datatable

  • Aggregate several columns

# data.table
DT[, lapply(.SD, mean),
   .SDcols = c("y","v")]

          y v
1: 3.333333 5
# datatable
DT[:, dt.mean([f.y,f.v])]

        y   v
0   3.33333 5
  • Modify columns using a condition

# data.table
DT[, .SD - 1,
   .SDcols = is.numeric]

   y v
1: 0 0
2: 2 1
3: 5 2
4: 0 3
5: 2 4
6: 5 5
7: 0 6
8: 2 7
9: 5 8
# datatable
DT[:, f[int]-1]

    C0      C1
0   0       0
1   2       1
2   5       2
3   0       3
4   2       4
5   5       5
6   0       6
7   2       7
8   5       8
  • Modify several columns and keep others unchanged

#data.table
DT[, c("y", "v") := lapply(.SD, sqrt),
   .SDcols = c("y", "v")]

   x        y        v
1: b 1.000000 1.000000
2: b 1.732051 1.414214
3: b 2.449490 1.732051
4: a 1.000000 2.000000
5: a 1.732051 2.236068
6: a 2.449490 2.449490
7: c 1.000000 2.645751
8: c 1.732051 2.828427
9: c 2.449490 3.000000
#datatable
# there is a square root function the datatable math module
DT[:, update(**{name:f[name]**0.5 for name in ("y","v")})]

    x       y       v
0   b       1       1
1   b       1.73205 1.41421
2   b       2.44949 1.73205
3   a       1       2
4   a       1.73205 2.23607
5   a       2.44949 2.44949
6   c       1       2.64575
7   c       1.73205 2.82843
8   c       2.44949 3

Grouping with by()

Action

data.table

datatable

Get the sum of column v grouped by column x

DT[, sum(v), by=x]

DT[:, dt.sum(f.v), by('x')]

Get sum of v where x != a

DT[x!="a", sum(v), by=x]

DT[f.x!="a", :][:, dt.sum(f.v), by("x")]

Number of rows per group

DT[, .N, by=x]

DT[:, dt.count(), by("x")]

Select first row of y and v for each group in x

DT[, .SD[1], by=x]

DT[0, :, by('x')]

Get row count and sum columns v and y by group

DT[, c(.N, lapply(.SD, sum)), by=x]

DT[:, [dt.count(), dt.sum(f[:])], by("x")]

Expressions in by()

DT[, sum(v), by=.(y%%2)]

DT[:, dt.sum(f.v), by(f.y%2)]

Get row per group where column v is minimum

DT[, .SD[which.min(v)], by=x]

DT[0, f[:], by("x"), dt.sort(f.v)]

First 2 rows of each group

DT[, head(.SD,2), by=x]

DT[:2, :, by("x")]

Last 2 rows of each group

DT[, tail(.SD,2), by=x]

DT[-2:, :, by("x")]

In R’s data.table, the order of the groupings is preserved; in datatable, the returned dataframe is sorted on the grouping column. DT[, sum(v), keyby=x] in data.table returns a dataframe ordered by column x.

In data.table, i is executed before the grouping, while in datatable, i is executed after the grouping.

Also, in datatable, f-expressions in the i section of a groupby is not yet implemented, hence the chaining method to get the sum of column v where x!=a.

Multiple aggregations within a group can be executed in R’s data.table with the syntax below

DT[, list(MySum=sum(v),
          MyMin=min(v),
          MyMax=max(v)),
   by=.(x, y%%2)]

The same can be replicated in datatable by using a dictionary

DT[:, {'MySum': dt.sum(f.v),
       'MyMin': dt.min(f.v),
       'MyMax': dt.max(f.v)},
   by(f.x, f.y%2)]

Add/Update/Delete Columns

Action

data.table

datatable

Add new column

DT[, z:=42L]

DT[:, update(z=42)] or
DT['z'] = 42 or
DT[:, 'z'] = 42 or
DT = DT[:, f[:].extend({"z":42})]

Add multiple columns

DT[, c('sv','mv') := .(sum(v), "X")]

DT[:, update(sv = dt.sum(f.v), mv = "X")] or
DT[:, f[:].extend({"sv": dt.sum(f.v), "mv": "X"})]

Remove column

DT[, z:=NULL]

del DT['z'] or
del DT[:, 'z'] or
DT = DT[:, f[:].remove(f.z)]

Subassign to existing v column

DT["a", v:=42L, on="x"]

DT[f.x=="a", update(v=42)] or
DT[f.x=="a", 'v'] = 42

Subassign to new column (NA padded)

DT["b", v2:=84L, on="x"]

DT[f.x=="b", update(v2=84)] or
DT[f.x=='b', 'v2'] = 84

Add new column, assigning values group-wise

DT[, m:=mean(v), by=x]

DT[:, update(m=dt.mean(f.v)), by("x")]

In data.table, you can create a new column with a variable

cols = 'rar'
DT[, ..cols:=4242]

Similar operation for the above in datatable

cols = 'rar'
DT[cols] = 4242
# or  DT[:, update(cols=4242)]

Note that the update() function, as well as the del function (a python keyword) operates in-place; there is no need for reassignment. Another advantage of the update() method is that the row order of the dataframe is not changed, even in a groupby; this comes in handy in a lot of transformation operations.

Joins

At the moment, only the left outer join is implemented in datatable. Another aspect is that the dataframe being joined must be keyed, the column or columns to be keyed must not have duplicates, and the joining column has to have the same name in both dataframes. You can read more about the join() API and have a look at the Tutorial on the join operator

Left join in R’s data.table:

DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))

X[DT, on="x"]

   x  v foo y i.v
1: b  7   2 1   1
2: b  7   2 3   2
3: b  7   2 6   3
4: a NA  NA 1   4
5: a NA  NA 3   5
6: a NA  NA 6   6
7: c  8   4 1   7
8: c  8   4 3   8
9: c  8   4 6   9

Join in datatable:

DT = dt.Frame(x = ["b"]*3 + ["a"]*3 + ["c"]*3,
          y = [1, 3, 6] * 3,
          v = range(1, 10))

X = dt.Frame({"x":('c','b'),
              "v":(8,7),
              "foo":(4,2)})

X.key="x" # key the ``x`` column

DT[:, :, join(X)]

    x       y       v       v.0     foo
0   b       1       1       7       2
1   b       3       2       7       2
2   b       6       3       7       2
3   a       1       4       NA      NA
4   a       3       5       NA      NA
5   a       6       6       NA      NA
6   c       1       7       8       4
7   c       3       8       8       4
8   c       6       9       8       4
  • An inner join could be simulated by removing the nulls. Again, a join() only works if the joining dataframe is keyed.

# data.table
DT[X, on="x", nomatch=NULL]

   x y v i.v foo
1: c 1 7   8   4
2: c 3 8   8   4
3: c 6 9   8   4
4: b 1 1   7   2
5: b 3 2   7   2
6: b 6 3   7   2
# datatable
DT[g[-1]!=None, :, join(X)] # g refers to the joining dataframe X

    x       y       v       v.0     foo
0   b       1       1       7       2
1   b       3       2       7       2
2   b       6       3       7       2
3   c       1       7       8       4
4   c       3       8       8       4
5   c       6       9       8       4
  • A not join can be simulated as well.

# data.table
DT[!X, on="x"]

   x y v
1: a 1 4
2: a 3 5
3: a 6 6
# datatable
DT[g[-1]==None, f[:], join(X)]

    x       y       v
0   a       1       4
1   a       3       5
2   a       6       6
  • Select the first row for each group

# data.table
DT[X, on="x", mult="first"]

   x y v i.v foo
1: c 1 7   8   4
2: b 1 1   7   2
# datatable
DT[g[-1]!=None, :, join(X)][0, :, by('x')] # chaining comes in handy here

    x       y       v       v.0     foo
0   b       1       1       7       2
1   c       1       7       8       4
  • Select the last row for each group

# data.table
DT[X, on="x", mult="last"]

   x y v i.v foo
1: c 6 9   8   4
2: b 6 3   7   2
# datatable
DT[g[-1]!=None, :, join(X)][-1, :, by('x')]

    x       y       v       v.0     foo
0   b       6       3       7       2
1   c       6       9       8       4
  • Join and evaluate j for each row in i

# data.table
DT[X, sum(v), by=.EACHI, on="x"]

   x V1
1: c 24
2: b  6
# datatable
DT[g[-1]!=None, :, join(X)][:, dt.sum(f.v), by("x")]

    x       v
0   b       6
1   c       24
  • Aggregate on columns from both dataframes in j

# data.table
DT[X, sum(v)*foo, by=.EACHI, on="x"]

   x V1
1: c 96
2: b 12
# datatable
DT[:, dt.sum(f.v*g.foo), join(X), by(f.x)][f[-1]!=0, :]

    x       C0
0   b       12
1   c       96
  • Aggregate on columns with same name from both dataframes in j

# data.table
DT[X, sum(v)*i.v, by=.EACHI, on="x"]

   x  V1
1: c 192
2: b  42
# datatable
DT[:, dt.sum(f.v*g.v), join(X), by(f.x)][f[-1]!=0, :]

    x       C0
0   b       42
1   c       192

Expect significant improvement in join functionality, with more concise syntax, as well as additions of more features, as datatable matures.

Functions in R/data.table not yet implemented

This is a list of some functions in data.table that do not have an equivalent in datatable yet, that we would likely implement

Also, at the moment, custom aggregations in the j section are not supported in datatable- we intend to implement that at some point.

There are no datetime functions in datatable, and string operations are limited as well.

If there are any functions that you would like to see in datatable, please head over to github and raise a feature request.

FTRL Model

This section describes the FTRL (Follow the Regularized Leader) model as implemented in datatable.

FTRL Model Information

The Follow the Regularized Leader (FTRL) model is a datatable implementation of the FTRL-Proximal online learning algorithm for binomial logistic regression. It uses a hashing trick for feature vectorization and the Hogwild approach for parallelization. FTRL for multinomial classification and continuous targets are implemented experimentally.

Create an FTRL Model

The FTRL model is implemented as the Ftrl Python class, which is a part of datatable.models, so to use the model you should first do

from datatable.models import Ftrl

and then create a model as

ftrl_model = Ftrl()

FTRL Model Parameters

The FTRL model requires a list of parameters for training and making predictions, namely:

  • alpha – learning rate, defaults to 0.005.

  • beta – beta parameter, defaults to 1.0.

  • lambda1 – L1 regularization parameter, defaults to 0.0.

  • lambda2 – L2 regularization parameter, defaults to 1.0.

  • nbins – the number of bins for the hashing trick, defaults to 10**6.

  • mantissa_nbits – the number of bits from mantissa to be used for hashing, defaults to 10.

  • nepochs – the number of epochs to train the model for, defaults to 1.

  • negative_class – whether to create and train on a “negative” class in the case of multinomial classification, defaults to False.

  • interactions — a list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. This setting defaults to None.

  • model_type — training mode that can be one of the following: “auto” to automatically set model type based on the target column data, “binomial” for binomial classification, “multinomial” for multinomial classification or “regression” for continuous targets. Defaults to "auto".

If some parameters need to be changed from their default values, this can be done either when creating the model, as

ftrl_model = Ftrl(alpha = 0.1, nbins = 100)

or, if the model already exists, as

ftrl_model.alpha = 0.1
ftrl_model.nbins = 100

If some parameters were not set explicitely, they will be assigned the default values.

Training a Model

Use the fit() method to train a model:

ftrl_model.fit(X_train, y_train)

where X_train is a frame of shape (nrows, ncols) to be trained on, and y_train is a target frame of shape (nrows, 1). The following datatable column types are supported for the X_train frame: bool, int, real and str.

FTRL model can also do early stopping, if relative validation error does not improve. For this the model should be fit as

res = ftrl_model.fit(X_train, y_train, X_validation, y_validation,
                     nepochs_validation, validation_error,
                     validation_average_niterations)

where X_train and y_train are training and target frames, respectively, X_validation and y_validation are validation frames, nepochs_validation specifies how often, in epoch units, validation error should be checked, validation_error is the relative validation error improvement that the model should demonstrate within nepochs_validation to continue training, and validation_average_niterations is the number of iterations to average when calculating the validation error. Returned res tuple contains epoch at which training stopped and the corresponding loss.

Resetting a Model

Use the reset() method to reset a model:

ftrl_model.reset()

This will reset model weights, but it will not affect learning parameters. To reset parameters to default values, you can do

ftrl_model.params = Ftrl().params

Making Predictions

Use the predict() method to make predictions:

targets = ftrl_model.predict(X)

where X is a frame of shape (nrows, ncols) to make predictions for. X should have the same number of columns as the training frame. The predict() method returns a new frame of shape (nrows, 1) with the predicted probability for each row of frame X.

Feature Importances

To estimate feature importances, the overall weight contributions are calculated feature-wise during training and predicting. Feature importances can be accessed as

fi = ftrl_model.feature_importances

where fi will be a frame of shape (nfeatures, 2) containing feature names and their importances, that are normalized to [0; 1] range.

Feature Interactions

By default each column of a training dataset is considered as a feature by FTRL model. User can provide additional features by specifying a list or a tuple of feature interactions, for instance as

ftrl_model.interactions = [["C0", "C1", "C3"], ["C2", "C5"]]

where C* are column names from a training dataset. In the above example two additional features, namely, C0:C1:C3 and C2:C5, are created.

interactions should be set before a call to fit() method, and can not be changed once the model is trained.

Further Reading

For detailed help, please also refer to help(Ftrl).

datatable API

Symbols listed here are available for import from the datatable module.

Submodules

math.

Mathematical functions, similar to python’s math module.

models.

A small set of data analysis tools.

internal.

Access to some internal details of datatable module.

Classes

Frame

Main “table of data” class. This is the equivalent of pandas’ or Julia’s DataFrame, R’s data.table or tibble, SQL’s TABLE, etc.

FExpr

Helper class for computing formulas over a frame.

Namespace

Helper class for addressing columns in a frame.

stype

Enum of column “storage” types, analogous to numpy’s dtype.

ltype

Enum of column “logical” types, similar to standard Python notion of a type.

Functions

fread()

Read CSV/text/XLSX/Jay/other files

iread()

Same as fread(), but read multiple files at once

by()

Group-by clause for use in Frame’s square-bracket selector

join()

Join clause for use in Frame’s square-bracket selector

sort()

Sort clause for use in Frame’s square-bracket selector

update()

Create new or update existing columns within a frame

cbind()

Combine frames by columns

rbind()

Combine frames by rows

repeat()

Concatenate frame by rows

ifelse()

Ternary if operator

shift()

Shift column by a given number of rows

cut()

Bin a column into equal-width intervals

qcut()

Bin a column into equal-population intervals

split_into_nhot()

Split and nhot-encode a single-column frame

init_styles()

Inject datatable’s stylesheets into the Jupyter notebook

rowall()

Row-wise all() function

rowany()

Row-wise any() function

rowcount()

Calculate the number of non-missing values per row

rowfirst()

Find the first non-missing value row-wise

rowlast()

Find the last non-missing value row-wise

rowmax()

Find the largest element row-wise

rowmean()

Calculate the mean value row-wise

rowmin()

Find the smallest element row-wise

rowsd()

Calculate the standard deviation row-wise

rowsum()

Calculate the sum of all values row-wise

intersect()

Calculate the set intersection of values in the frames

setdiff()

Calculate the set difference between the frames

symdiff()

Calculate the symmetric difference between the sets of values in the frames

union()

Calculate the union of values in the frames

unique()

Find unique values in a frame

corr()

Calculate correlation between two columns

count()

Count non-missing values per a column

cov()

Calculate covariance between two columns

max()

Find the largest element per a column

mean()

Calculate mean value per a column

median()

Find the median element per a column

min()

Find the smallest element per a column

sd()

Calculate the standard deviation per a column

sum()

Calculate the sum of all values per a column

Other

build_info

Information about the build of the datatable module.

dt

The datatable module.

f

The primary namespace used during DT[...] call.

g

Secondary namespace used during DT[..., join()] call.

options

datatable options.

datatable.internal

The functions in this sub-module are considered to be “internal” and not useful for day-to-day work with datatable module.

compiler_version()

Compiler used when building datatable.

frame_column_data_r()

C pointer to column’s data

frame_columns_virtual()

Indicators of which columns in the frame are virtual.

frame_integrity_check()

Run checks on whether the frame’s state is corrupted.

get_thread_ids()

Get ids of threads spawned by datatable.

in_debug_mode()

Was datatable built in debug mode?

regex_supported()

Was datatable built with support for regular expressions?

datatable.internal.compiler_version()
compiler_version
(
)

Return the version of the C++ compiler used to compile this module.

Deprecated since version 0.11.0.

datatable.internal.frame_column_data_r()
frame_column_data_r
(

Return C pointer to the main data array of the column frame[i]. The column will be materialized if it was virtual.

Parameters
frame
Frame

The Frame where to look up the column.

i
int

The index of a column, in the range [0; ncols).

return
ctypes.c_void_p

The pointer to the column’s internal data.

datatable.internal.frame_columns_virtual()
frame_columns_virtual
(

Return the list indicating which columns in the frame are virtual.

Parameters
return
List[bool]

Each element in the list indicates whether the corresponding column is virtual or not.

Notes

Deprecated since version 0.11.0.

This function will be expanded and moved into the main Frame class.

datatable.internal.frame_integrity_check()
frame_integrity_check
(

This function performs a range of tests on the frame to verify that its internal state is consistent. It returns None on success, or throws an AssertionError if any problems were found.

Parameters
frame
Frame

A Frame object that needs to be checked for internal consistency.

return
None
except
AssertionError

An exception is raised if there were any issues with the frame.

datatable.internal.get_thread_ids()

Return system ids of all threads used internally by datatable.

Calling this function will cause the threads to spawn if they haven’t done already. (This behavior may change in the future).

Parameters
return
List[str]

The list of thread ids used by the datatable. The first element in the list is the id of the main thread.

See Also
datatable.internal.in_debug_mode()

Return True if datatable was compiled in debug mode.

Deprecated since version 0.11.0.

datatable.internal.regex_supported()
regex_supported
(
)

Was the datatable built with regular expression support?

Deprecated since version 0.11.0.

datatable.math

Trigonometric functions

sin(x)

Compute \(\sin x\) (the trigonometric sine of x).

cos(x)

Compute \(\cos x\) (the trigonometric cosine of x).

tan(x)

Compute \(\tan x\) (the trigonometric tangent of x).

arcsin(x)

Compute \(\sin^{-1} x\) (the inverse sine of x).

arccos(x)

Compute \(\cos^{-1} x\) (the inverse cosine of x).

arctan(x)

Compute \(\tan^{-1} x\) (the inverse tangent of x).

atan2(x, y)

Compute \(\tan^{-1} (x/y)\).

hypot(x, y)

Compute \(\sqrt{x^2 + y^2}\).

deg2rad(x)

Convert an angle measured in degrees into radians.

rad2deg(x)

Convert an angle measured in radians into degrees.

Hyperbolic functions

sinh(x)

Compute \(\sinh x\) (the hyperbolic sine of x).

cosh(x)

Compute \(\cosh x\) (the hyperbolic cosine of x).

tanh(x)

Compute \(\tanh x\) (the hyperbolic tangent of x).

arsinh(x)

Compute \(\sinh^{-1} x\) (the inverse hyperbolic sine of x).

arcosh(x)

Compute \(\cosh^{-1} x\) (the inverse hyperbolic cosine of x).

artanh(x)

Compute \(\tanh^{-1} x\) (the inverse hyperbolic tangent of x).

Exponential/logarithmic functions

exp(x)

Compute \(e^x\) (the exponent of x).

exp2(x)

Compute \(2^x\).

expm1(x)

Compute \(e^x - 1\).

log(x)

Compute \(\ln x\) (the natural logarithm of x).

log10(x)

Compute \(\log_{10} x\) (the decimal logarithm of x).

log1p(x)

Compute \(\ln(1 + x)\).

log2(x)

Compute \(\log_{2} x\) (the binary logarithm of x).

logaddexp(x)

Compute \(\ln(e^x + e^y)\).

logaddexp2(x)

Compute \(\log_2(2^x + 2^y)\).

cbrt(x)

Compute \(\sqrt[3]{x}\) (the cubic root of x).

pow(x, a)

Compute \(x^a\).

sqrt(x)

Compute \(\sqrt{x}\) (the square root of x).

square(x)

Compute \(x^2\) (the square of x).

Special mathemetical functions

erf(x)

The error function \(\operatorname{erf}(x)\).

erfc(x)

The complimentary error function \(1 - \operatorname{erf}(x)\).

gamma(x)

Euler gamma function of x.

lgamma(x)

Natual logarithm of the Euler gamma function of.

Floating-point functions

abs(x)

Absolute value of x.

ceil(x)

The smallest integer not less than x.

copysign(x, y)

Number with the magnitude of x and the sign of y.

fabs(x)

The absolute value of x, returned as a float.

floor(x)

The largest integer not greater than x.

fmod(x, y)

Remainder of a floating-point division x/y.

isclose(x, y)

Check whether x y (up to some tolerance level).

isfinite(x)

Check if x is finite.

isinf(x)

Check if x is a positive or negative infinity.

isna(x)

Check if x is a valid (not-NaN) value.

ldexp(x, y)

Compute \(x\cdot 2^y\).

rint(x)

Round x to the nearest integer.

sign(x)

The sign of x, as a floating-point value.

signbit(x)

The sign of x, as a boolean value.

trunc(x)

The value of x truncated towards zero.

Mathematical constants

e

Euler’s constant \(e\).

golden

Golden ratio \(\varphi\).

inf

Positive infinity.

nan

Not-a-number.

pi

Mathematical constant \(\pi\).

tau

Mathematical constant \(\tau\).

Comparison table

The set of functions provided by the datatable.math module is very similar to the standard Python’s math module, or numpy math functions. Below is the comparison table showing which functions are available:

math

numpy

datatable

Trigonometric/hyperbolic functions

sin(x)

sin(x)

sin(x)

cos(x)

cos(x)

cos(x)

tan(x)

tan(x)

tan(x)

asin(x)

arcsin(x)

arcsin(x)

acos(x)

arccos(x)

arccos(x)

atan(x)

arctan(x)

arctan(x)

atan2(y, x)

arctan2(y, x)

atan2(y, x)

sinh(x)

sinh(x)

sinh(x)

cosh(x)

cosh(x)

cosh(x)

tanh(x)

tanh(x)

tanh(x)

asinh(x)

arcsinh(x)

arsinh(x)

acosh(x)

arccosh(x)

arcosh(x)

atanh(x)

arctanh(x)

artanh(x)

hypot(x, y)

hypot(x, y)

hypot(x, y)

radians(x)

deg2rad(x)

deg2rad(x)

degrees(x)

rad2deg(x)

rad2deg(x)

Exponential/logarithmic/power functions

exp(x)

exp(x)

exp(x)

exp2(x)

exp2(x)

expm1(x)

expm1(x)

expm1(x)

log(x)

log(x)

log(x)

log10(x)

log10(x)

log10(x)

log1p(x)

log1p(x)

log1p(x)

log2(x)

log2(x)

log2(x)

logaddexp(x, y)

logaddexp(x, y)

logaddexp2(x, y)

logaddexp2(x, y)

cbrt(x)

cbrt(x)

pow(x, a)

power(x, a)

pow(x, a)

sqrt(x)

sqrt(x)

sqrt(x)

square(x)

square(x)

Special mathematical functions

erf(x)

erf(x)

erfc(x)

erfc(x)

gamma(x)

gamma(x)

heaviside(x)

i0(x)

lgamma(x)

lgamma(x)

sinc(x)

Floating-point functions

abs(x)

abs(x)

abs(x)

ceil(x)

ceil(x)

ceil(x)

copysign(x, y)

copysign(x, y)

copysign(x, y)

fabs(x)

fabs(x)

fabs(x)

floor(x)

floor(x)

floor(x)

fmod(x, y)

fmod(x, y)

fmod(x)

frexp(x)

frexp(x)

isclose(x, y)

isclose(x, y)

isclose(x, y)

isfinite(x)

isfinite(x)

isfinite(x)

isinf(x)

isinf(x)

isinf(x)

isnan(x)

isnan(x)

isna(x)

ldexp(x, n)

ldexp(x, n)

ldexp(x, n)

modf(x)

modf(x)

nextafter(x, y)

rint(x)

rint(x)

round(x)

round(x)

round(x)

sign(x)

sign(x)

signbit(x)

signbit(x)

spacing(x)

trunc(x)

trunc(x)

trunc(x)

Miscellaneous

clip(x, a, b)

comb(n, k)

divmod(x, y)

factorial(n)

gcd(a, b)

gcd(a, b)

maximum(x, y)

minimum(x, y)

Mathematical constants

e

e

e

golden

inf

inf

inf

nan

nan

nan

pi

pi

pi

tau

tau

datatable.math.abs()

Return the absolute value of x. This function can only be applied to numeric arguments (i.e. boolean, integer, or real).

This function upcasts columns of types bool8, int8 and int16 into int32; for columns of other types the stype is kept.

Parameters
x
FExpr

Column expression producing one or more numeric columns.

return
FExpr

The resulting FExpr evaluates absolute values in all elements in all columns of x.

Examples
DT = dt.Frame(A=[-3, 2, 4, -17, 0])
DT[:, abs(f.A)]
C0
03
12
24
317
40
See also
datatable.math.arccos()

Inverse trigonometric cosine of x.

In mathematics, this may be written as \(\arccos x\) or \(\cos^{-1}x\).

The returned value is in the interval \([0, \frac12\tau]\), and NA for the values of x that lie outside the interval [-1, 1]. This function is the inverse of cos() in the sense that cos(arccos(x)) == x for all x in the interval [-1, 1].

See also
  • cos(x) – the trigonometric cosine function;

  • arcsin(x) – the inverse sine function.

datatable.math.arcosh()

The inverse hyperbolic cosine of x.

This function satisfies the property that cosh(arccosh(x)) == x. Alternatively, this function can also be computed as \(\cosh^{-1}(x) = \ln(x + \sqrt{x^2 - 1})\).

See also
  • cosh – hyperbolic cosine;

  • arsinh – inverse hyperbolic sine.

datatable.math.arcsin()

Inverse trigonometric sine of x.

In mathematics, this may be written as \(\arcsin x\) or \(\sin^{-1}x\).

The returned value is in the interval \([-\frac14 \tau, \frac14\tau]\), and NA for the values of x that lie outside the interval [-1, 1]. This function is the inverse of sin() in the sense that sin(arcsin(x)) == x for all x in the interval [-1, 1].

See also
  • sin(x) – the trigonometric sine function;

  • arccos(x) – the inverse cosine function.

datatable.math.arctan()

Inverse trigonometric tangent of x.

This function satisfies the property that tan(arctan(x)) == x.

See also
  • atan2(x, y) – two-argument inverse tangent function;

  • tan(x) – the trigonometric tangent function.

datatable.math.arsinh()

The inverse hyperbolic sine of x.

This function satisfies the property that sinh(arcsinh(x)) == x. Alternatively, this function can also be computed as \(\sinh^{-1}(x) = \ln(x + \sqrt{x^2 + 1})\).

See also
  • sinh – hyperbolic sine;

  • arcosh – inverse hyperbolic cosine.

datatable.math.artanh()

The inverse hyperbolic tangent of x.

This function satisfies the property that sinh(arcsinh(x)) == x. Alternatively, this function can also be computed as \(\tanh^{-1}(x) = \frac12\ln\frac{1+x}{1-x}\).

See also
  • tanh – hyperbolic tangent;

datatable.math.atan2()

The inverse trigonometric tangent of y/x, taking into account the signs of x and y to produce the correct result.

If (x,y) is a point in a Cartesian plane, then arctan2(y, x) returns the radian measure of an angle formed by two rays: one starting at the origin and passing through point (0,1), and the other starting at the origin and passing through point (x,y). The angle is assumed positive if the rotation from the first ray to the second occurs counter-clockwise, and negative otherwise.

As a special case, arctan2(0, 0) == 0, and arctan2(0, -1) == tau/2.

See also
datatable.math.cbrt()

Cubic root of x.

datatable.math.ceil()

The smallest integer value not less than x, returned as float.

This function produces a float32 column if the input is of type float32, or float64 columns for inputs of all other numeric stypes.

Parameters
x
FExpr

One or more numeric columns.

return
FExpr

Expression that computes the ceil() function for each row and column in x.

datatable.math.copysign()

Return a float with the magnitude of x and the sign of y.

datatable.math.cos()

Compute the trigonometric cosine of angle x measured in radians.

This function can only be applied to numeric columns (real, integer, or boolean), and produces a float64 result, except when the argument x is float32, in which case the result is float32 as well.

See also
  • sin(x) – the trigonometric sine function;

  • arccos(x) – the inverse cosine function.

datatable.math.cosh()

The hyperbolic cosine of x, defined as \(\cosh x = \frac12(e^x + e^{-x})\).

See also
  • sinh – hyperbolic sine;

  • arcosh – inverse hyperbolic cosine.

datatable.math.deg2rad()

Convert angle measured in degrees into radians: \(\operatorname{deg2rad}(x) = x\cdot\frac{\tau}{360}\).

See also
datatable.math.e

The base of the natural logarithm \(e\), also known as the Euler’s number. This number is defined as the limit \(e = \lim_{n\to\infty}(1 + 1/n)^n\).

The value is stored at float64 precision, and is equal to 2.718281828459045.

See Also
  • math.e – The Euler’s number in the Python math module;

datatable.math.erf()

Error function erf(x), which is defined as the integral

\[\operatorname{erf}(x) = \sqrt{\frac{8}{\tau}} \int^x_0 e^{-t^2}dt\]

This function is used in computing probabilities arising from the normal distribution.

See also
  • erfc(x) – complimentary error function.

datatable.math.erfc()

Complementary error function erfc(x) = 1 - erf(x).

The complementary error function is defined as the integral

\[\operatorname{erfc}(x) = \sqrt{\frac{8}{\tau}} \int^{\infty}_x e^{-t^2}dt\]

Although mathematically erfc(x) = 1-erf(x), in practice the RHS suffers catastrophic loss of precision at large values of x. This function, however, does not have such a drawback.

See also
  • erf(x) – the error function.

datatable.math.exp()

The exponent of x, that is \(e^x\).

See also
  • e – the Euler’s number;

  • expm1(x) – exponent function minus one;

  • exp2(x) – binary exponent;

datatable.math.exp2()

Binary exponent of x, same as \(2^x\).

See also
  • exp(x) – base-\(e\) exponent.

datatable.math.expm1()

The exponent of x minus 1, that is \(e^x - 1\). This function is more accurate for arguments x close to zero.

datatable.math.fabs()

The absolute value of x, returned as float.

datatable.math.floor()

The largest integer value not greater than x, returned as float.

This function produces a float32 column if the input is of type float32, or float64 columns for inputs of all other numeric stypes.

Parameters
x
FExpr

One or more numeric columns.

return
FExpr

Expression that computes the floor() function for each row and column in x.

datatable.math.fmod()

Floating-point remainder of the division x/y. The result is always a float, even if the arguments are integers. This function uses std::fmod() from the standard C++ library, its convention for handling of negative numbers may be different than the Python’s.

datatable.math.gamma()

Euler Gamma function of x.

The gamma function is defined for all x except for the negative integers. For positive x it can be computed via the integral

\[\Gamma(x) = \int_0^\infty t^{x-1}e^{-t}dt\]

For negative x it can be computed as

\[\Gamma(x) = \frac{\Gamma(x + k)}{x(x+1)\cdot...\cdot(x+k-1)}\]

where \(k\) is any integer such that \(x+k\) is positive.

If x is a positive integer, then \(\Gamma(x) = (x - 1)!\).

See also
datatable.math.golden

The golden ratio \(\varphi = (1 + \sqrt{5})/2\), also known as golden section. This is a number such that if \(a = \varphi b\), for some non-zero \(a\) and \(b\), then it must also be true that \(a + b = \varphi a\).

The constant is stored with float64 precision, and its value is 1.618033988749895.

datatable.math.hypot()

The length of the hypotenuse of a right triangle with sides x and y, or in math notation \(\operatorname{hypot}(x, y) = \sqrt{x^2 + y^2}\).

datatable.math.inf

Number representing positive infinity \(\infty\). Write -inf for negative infinity.

datatable.math.isclose()
isclose
(
, ,
*
,
rtol=1e-5
,
atol=1e-8
)

Compare two numbers x and y, and return True if they are close within the requested relative/absolute tolerance. This function only returns True/False, never NA.

More specifically, isclose(x, y) is True if either of the following are true:

  • x == y (including the case when x and y are NAs),

  • abs(x - y) <= atol + rtol * abs(y) and neither x nor y are NA

The tolerance parameters rtol, atol must be positive floats, and cannot be expressions.

datatable.math.isfinite()

Returns True if x has a finite value, and False if x is infinity or NaN. This function is equivalent to !(isna(x) or isinf(x)).

See also
datatable.math.isinf()

Returns True if the argument is +/- infinity, and False otherwise. Note that isinf(NA) == False.

datatable.math.isna()

Returns True if the argument is NA, and False otherwise.

datatable.math.ldexp()

Multiply x by 2 raised to the power y, i.e. compute x * 2**y. Column x is expected to be float, and y integer.

datatable.math.lgamma()

Natural logarithm of the absolute value of the Euler Gamma function of x.

datatable.math.log()

Natural logarithm of x, aka \(\ln x\). This function is the inverse of exp().

See also
datatable.math.log10()

Decimal (base-10) logarithm of x, which is \(\lg(x)\) or \(\log_{10} x\). This function is the inverse of pow(10, x).

See also
  • log() – natural logarithm;

  • log2() – binary logarithm.

datatable.math.log1p()

Natural logarithm of 1 plus x, or \(\ln(1 + x)\). This function has improved numeric precision for small values of x.

datatable.math.log2()

Binary (base-2) logarithm of x, which in mathematics is \(\log_2 x\).

See also
  • log() – natural logarithm;

  • log10() – decimal logarithm.

datatable.math.logaddexp()

The logarithm of the sum of exponents of x and y. This function is equivalent to log(exp(x) + exp(y)), but does not suffer from catastrophic precision loss for small values of x and y.

datatable.math.logaddexp2()

Binary logarithm of the sum of binary exponents of x and y. This function is equivalent to log2(exp2(x) + exp2(y)), but does not suffer from catastrophic precision loss for small values of x and y.

datatable.math.nan

Not-a-number, a special floating-point constant that denotes a missing number. In most datatable functions you can use None instead of nan.

datatable.math.pi

Mathematical constant \(\pi = \frac12\tau\), also known as Archimedes’ constant, equal to the length of a semicircle with radius 1, or equivalently the arc-length of a \(180^\circ\) angle [1].

The constant is stored at float64 precision, and its value is 3.141592653589793.

See Also
  • tau – mathematical constant \(\tau = 2\pi\);

  • math.pi – The \(\pi\) constant in the Python math module;

datatable.math.pow()

Number x raised to the power y. The return value will be float, even if the arguments x and y are integers.

This function is equivalent to x ** y.

datatable.math.rad2deg()

Convert angle measured in radians into degrees: \(\operatorname{rad2deg}(x) = x\cdot\frac{360}{\tau}\).

See also
datatable.math.rint()

Round the value x to the nearest integer.

datatable.math.round()
New in version 0.11

Round the values in cols up to the specified number of the digits of precision ndigits. If the number of digits is omitted, rounds to the nearest integer.

Generally, this operation is equivalent to:

rint(col * 10**ndigits) / 10**ndigits

where function rint() rounds to the nearest integer.

Parameters
cols
FExpr

Input data for rounding. This could be an expression yielding either a single or multiple columns. The round() function will apply to each column independently and produce as many columns in the output as there were in the input.

Only numeric columns are allowed: boolean, integer or float. An exception will be raised if cols contains a non-numeric column.

ndigits
int | None

The number of precision digits to retain. This parameter could be either positive or negative (or None). If positive then it gives the number of digits after the decimal point. If negative, then it rounds the result up to the corresponding power of 10.

For example, 123.45 rounded to ndigits=1 is 123.4, whereas rounded to ndigits=-1 it becomes 120.0.

return
FExpr

f-expression that rounds the values in its first argument to the specified number of precision digits.

Each input column will produce the column of the same stype in the output; except for the case when ndigits is None and the input is either float32 or float64, in which case an int64 column is produced (similarly to python’s round()).

Notes

Values that are exactly half way in between their rounded neighbors are converted towards their nearest even value. For example, both 7.5 and 8.5 are rounded into 8, whereas 6.5 is rounded as 6.

Rounding integer columns may produce unexpected results for values that are close to the min/max value of that column’s storage type. For example, when an int8 value 127 is rounded to nearest 10, it becomes 130. However, since 130 cannot be represented as int8 a wrap-around occurs and the result becomes -126.

Rounding an integer column to a positive ndigits is a noop: the column will be returned unchanged.

Rounding an integer column to a large negative ndigits will produce a constant all-0 column.

datatable.math.sign()

The sign of x, returned as float.

This function returns 1.0 if x is positive (including positive infinity), -1.0 if x is negative, 0.0 if x is zero, and NA if x is NA.

datatable.math.signbit()

Returns True if x is negative (its sign bit is set), and False if x is positive. This function is able to distinguish between -0.0 and +0.0, returning True/False respectively. If x is an NA value, this function will also return NA.

datatable.math.sin()

Compute the trigonometric sine of angle x measured in radians.

This function can only be applied to numeric columns (real, integer, or boolean), and produces a float64 result, except when the argument x is float32, in which case the result is float32 as well.

See also
  • cos(x) – the trigonometric cosine function;

  • arcsin(x) – the inverse sine function.

datatable.math.sinh()

Hyperbolic sine of x, defined as \(\sinh x = \frac12(e^x - e^{-x})\).

See also
  • cosh – hyperbolic cosine;

  • arsinh – inverse hyperbolic sine.

datatable.math.sqrt()

The square root of x, same as x ** 0.5.

datatable.math.square()

The square of x, same as x ** 2.0. As with all other math functions, the result is floating-point, even if the argument x is integer.

datatable.math.tan()

Compute the trigonometric tangent of x, which is the ratio sin(x)/cos(x).

This function can only be applied to numeric columns (real, integer, or boolean), and produces a float64 result, except when the argument x is float32, in which case the result is float32 as well.

See also
  • arctan(x) – the inverse tangent function.

datatable.math.tanh()

Hyperbolic tangent of x, defined as \(\tanh x = \frac{\sinh x}{\cosh x} = \frac{e^x-e^{-x}}{e^x+e^{-x}}\).

See also
  • artanh – inverse hyperbolic tangent.

datatable.math.tau

Mathematical constant \(\tau\), also known as a turn, equal to the circumference of a circle with a unit radius.

The constant is stored at float64 precision, and its value is 6.283185307179586.

See Also
  • pi – mathematical constant \(\pi = \frac12\tau\);

  • math.tau – The \(\tau\) constant in the Python math module;

  • Tau manifesto

datatable.math.trunc()

The nearest integer value not greater than x in magnitude.

If x is integer or boolean, then trunc() will return this value converted to float64. If x is floating-point, then trunc(x) acts as floor(x) for positive values of x, and as ceil(x) for negative values of x. This rounding mode is known as rounding towards zero.

datatable.models

Classes

Ftrl

FTRL-Proximal online learning model.

Functions

aggregate()

Aggregate a frame.

kfold()

Perform k-fold split.

kfold_random()

Perform randomized k-fold split.

datatable.models.Ftrl

Follow the Regularized Leader (FTRL) model.

FTRL model is a datatable implementation of the FTRL-Proximal online learning algorithm for binomial logistic regression. Multinomial classification and regression for continuous targets are also implemented, though these implementations are experimental. This model is fully parallel and is based on the Hogwild approach for parallelization.

The model supports datasets with the both numerical (boolean, integer and float types) and string features. To vectorize features a hashing trick is employed, such that all the values are hashed with the 64-bit hashing function. This function is implemented as follows:

  • for booleans and integers the hashing function is essentially an identity function;

  • for floats the hashing function trims mantissa, taking into account mantissa_nbits, and interprets the resulting bit representation as a 64-bit unsigned integer;

  • for strings the 64-bit Murmur2 hashing function is used.

To compute the final hash x the Murmur2 hashed feature name is added to the hashed feature and the result is modulo divided by the number of requested bins, i.e. by nbins.

For each hashed row of data, according to Ad Click Prediction: a View from the Trenches, the following FTRL-Proximal algorithm is employed:

Per-coordinate FTRL-Proximal online learning algorithm

When trained, the model can be used to make predictions, or it can be re-trained on new datasets as many times as needed improving model weights from run to run.

Construction

Ftrl()

Construct the Ftrl object.

Methods

fit()

Train the model.

predict()

Predict for a trained model.

reset()

Reset the model.

Properties

alpha

\(\alpha\) in per-coordinate FTRL-Proximal algorithm.

beta

\(\beta\) in per-coordinate FTRL-Proximal algorithm.

colnames

Column names of the training frame, i.e. features.

colname_hashes

Hashes of the column names.

double_precision

An option to control precision of the internal computations.

feature_importances

Feature importances calculated during training.

interactions

Feature interactions.

labels

Classification labels.

lambda1

L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm.

lambda2

L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm.

mantissa_nbits

Number of mantissa bits for hashing floats.

model

The model’s z and n coefficients.

model_type

A model type Ftrl should build.

model_type_trained

A model type Ftrl has built.

nbins

Number of bins for the hashing trick.

negative_class

An option to indicate if the “negative” class should be a created for multinomial classification.

nepochs

Number of training epochs.

params

All the input model parameters as a named tuple.

datatable.models.Ftrl.__init__()

Create a new Ftrl object.

Parameters
alpha
float

\(\alpha\) in per-coordinate FTRL-Proximal algorithm, should be positive.

beta
float

\(\beta\) in per-coordinate FTRL-Proximal algorithm, should be non-negative.

lambda1
float

L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.

lambda2
float

L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm. It should be non-negative.

nbins
int

Number of bins to be used for the hashing trick, should be positive.

mantissa_nbits
int

Number of mantissa bits to take into account when hashing floats. It should be non-negative and less than or equal to 52, that is a number of mantissa bits allocated for a C++ 64-bit double.

nepochs
float

Number of training epochs, should be non-negative. When nepochs is an integer number, the model will train on all the data provided to fit() method nepochs times. If nepochs has a fractional part {nepochs}, the model will train on all the data [nepochs] times, i.e. the integer part of nepochs. Plus, it will also perform an additional training iteration on the {nepochs} fraction of data.

double_precision
bool

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations. It is not guaranteed, that setting double_precision to True will automatically improve the model accuracy. It will, however, roughly double the memory footprint of the Ftrl object.

negative_class
bool

An option to indicate if a “negative” class should be created in the case of multinomial classification. For the “negative” class the model will train on all the negatives, and if a new label is encountered in the target column, its weights will be initialized to the current “negative” class weights. If negative_class is set to False, the initial weights become zeros.

interactions
List[List[str] | Tuple[str]] | Tuple[List[str] | Tuple[str]]

A list or a tuple of interactions. In turn, each interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame. Each interaction should have at least one feature.

model_type
"binomial" | "multinomial" | "regression" | "auto"

The model type to be built. When this option is "auto" then the model type will be automatically choosen based on the target column stype.

params
FtrlParams

Named tuple of the above parameters. One can pass either this tuple, or any combination of the individual parameters to the constructor, but not both at the same time.

except
ValueError

The exception is raised if both the params and one of the individual model parameters are passed at the same time.

datatable.models.Ftrl.alpha

\(\alpha\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current alpha value.

newalpha
float

New alpha value, should be positive.

except
ValueError

The exception is raised when newalpha is not positive.

datatable.models.Ftrl.beta

\(\beta\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current beta value.

newbeta
float

New beta value, should be non-negative.

except
ValueError

The exception is raised when newbeta is negative.

datatable.models.Ftrl.colnames

Column names of the training frame, i.e. the feature names.

Parameters
return
List[str]

A list of the column names.

See also
datatable.models.Ftrl.colname_hashes

Hashes of the column names used for the hashing trick as described in the Ftrl class description.

Parameters
return
List[int]

A list of the column name hashes.

See also
  • colnames – the column names of the training frame, i.e. the feature names.

datatable.models.Ftrl.double_precision

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations. This option is read-only and can only be set during the Ftrl object construction.

Parameters
return
bool

Current double_precision value.

datatable.models.Ftrl.feature_importances
feature_importances

Feature importances as calculated during the model training and normalized to [0; 1]. The normalization is done by dividing the accumulated feature importances over the maximum value.

Parameters
return
Frame

A frame with two columns: feature_name that has stype str32, and feature_importance that has stype float32 or float64 depending on whether the double_precision option is False or True.

datatable.models.Ftrl.fit()

Train FTRL model on a dataset.

Parameters
X_train
Frame

Training frame.

y_train
Frame

Target frame having as many rows as X_train and one column.

X_validation
Frame

Validation frame having the same number of columns as X_train.

y_validation
Frame

Validation target frame of shape (nrows, 1).

nepochs_validation
float

Parameter that specifies how often, in epoch units, validation error should be checked.

validation_error
float

The improvement of the relative validation error that should be demonstrated by the model within nepochs_validation epochs, otherwise the training will stop.

validation_average_niterations
int

Number of iterations that is used to average the validation error. Each iteration corresponds to nepochs_validation epochs.

return
FtrlFitOutput

FtrlFitOutput is a Tuple[float, float] with two fields: epoch and loss, representing the final fitting epoch and the final loss, respectively. If validation dataset is not provided, the returned epoch equals to nepochs and the loss is just float('nan').

See also
datatable.models.Ftrl.interactions

The feature interactions to be used for model training. This option is read-only for a trained model.

Parameters
return
Tuple

Current interactions value.

newinteractions
List[List[str] | Tuple[str]] | Tuple[List[str] | Tuple[str]]

New interactions value. Each particular interaction should be a list or a tuple of feature names, where each feature name is a column name from the training frame.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • one of the interactions has zero features.

except
TypeError

The exception is raised when newinteractions has a wrong type.

datatable.models.Ftrl.labels

Classification labels the model was trained on.

Parameters
return
Frame

A one-column frame with the classification labels. In the case of the numeric regression the label is the target column name.

datatable.models.Ftrl.lambda1

L1 regularization parameter, \(\lambda_1\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current lambda1 value.

newlambda1
float

New lambda1 value, should be non-negative.

except
ValueError

The exception is raised when newlambda1 is negative.

datatable.models.Ftrl.lambda2

L2 regularization parameter, \(\lambda_2\) in per-coordinate FTRL-Proximal algorithm.

Parameters
return
float

Current lambda2 value.

newlambda2
float

New lambda2 value, should be non-negative.

except
ValueError

The exception is raised when newlambda2 is negative.

datatable.models.Ftrl.model

Trained models weights, i.e. z and n coefficients in per-coordinate FTRL-Proximal algorithm.

Parameters
return
Frame

A frame of shape (nbins, 2 * nlabels), where nlabels is the total number of labels the model was trained on, and nbins is the number of bins used for the hashing trick. Odd and even columns represent the z and n model coefficients, respectively.

datatable.models.Ftrl.model_type

A type of the model Ftrl should build:

  • "binomial" for binomial classification;

  • "multinomial" for multinomial classification;

  • "regression" for numeric regression;

  • "auto" for automatic model type detection based on the target column stype.

This option is read-only for a trained model.

Parameters
return
str

Current model_type value.

newmodel_type
"binomial" | "multinomial" | "regression" | "auto"

New model_type value.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • newmodel_type value is not one of the following: "binomial", "multinomial", "regression" or "auto".

See also
datatable.models.Ftrl.model_type_trained

The model type Ftrl has built.

Parameters
return
str

Could be one of the following: "regression", "binomial", "multinomial" or "none" for untrained model.

See also
datatable.models.Ftrl.mantissa_nbits

Number of mantissa bits to take into account for hashing floats. This option is read-only for a trained model.

Parameters
return
int

Current mantissa_nbits value.

newmantissa_nbits
int

New mantissa_nbits value, should be non-negative and less than or equal to 52, that is a number of mantissa bits in a C++ 64-bit double.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • newmantissa_nbits value is negative or larger than 52.

datatable.models.Ftrl.nbins

Number of bins to be used for the hashing trick. This option is read-only for a trained model.

Parameters
return
int

Current nbins value.

newnbins
int

New nbins value, should be positive.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has already been trained;

  • newnbins value is not positive.

datatable.models.Ftrl.negative_class

An option to indicate if a “negative” class should be created in the case of multinomial classification. For the “negative” class the model will train on all the negatives, and if a new label is encountered in the target column, its weights are initialized to the current “negative” class weights. If negative_class is set to False, the initial weights become zeros.

This option is read-only for a trained model.

Parameters
return
bool

Current negative_class value.

newnegative_class
bool

New negative_class value.

except
ValueError

The exception is raised when trying to change this option for a model that has already been trained.

except
TypeError

The exception is raised when newnegative_class is not bool.

datatable.models.Ftrl.nepochs

Number of training epochs. When nepochs is an integer number, the model will train on all the data provided to fit() method nepochs times. If nepochs has a fractional part {nepochs}, the model will train on all the data [nepochs] times, i.e. the integer part of nepochs. Plus, it will also perform an additional training iteration on the {nepochs} fraction of data.

Parameters
return
float

Current nepochs value.

newnepochs
float

New nepochs value, should be non-negative.

except
ValueError

The exception is raised when newnepochs value is negative.

datatable.models.Ftrl.params

Ftrl model parameters as a named tuple FtrlParams, see Ftrl.__init__() for more details. This option is read-only for a trained model.

Parameters
return
FtrlParams

Current params value.

newparams
FtrlParams

New params value.

except
ValueError

The exception is raised when

  • trying to change this option for a model that has alerady been trained;

  • individual parameter values are incompatible with the corresponding setters.

datatable.models.Ftrl.predict()

Make predictions for a dataset.

Parameters
X
Frame

A frame to make predictions for. It should have the same number of columns as the training frame.

return
Frame

A new frame of shape (X.nrows, nlabels) with the predicted probabilities for each row of frame X and each of nlabels labels the model was trained for.

See also
  • fit() – train model on a dataset.

  • reset() – reset the model.

datatable.models.Ftrl.reset()

Reset Ftrl model by resetting all the model weights, labels and feature importance information.

Parameters
return
None
See also
  • fit() – train model on a dataset.

  • predict() – predict on a dataset.

datatable.models.aggregate()

Aggregate a frame into clusters. Each cluster consists of a set of members, i.e. a subset of the input frame, and is represented by an exemplar, i.e. one of the members.

For one- and two-column frames the aggregation is based on the standard equal-interval binning for numeric columns, and grouping for string columns.

When the input frame has more columns than two, a parallel one-pass Ad-Hoc algorithm is employed, see description of Aggregator<T>::group_nd() method for more details. This algorithm takes into account the numeric columns only, and all the string columns are ignored.

Parameters
frame
Frame

The input frame containing numeric or string columns.

min_rows
int

Minimum number of rows the input frame should have to be aggregated. If frame has less rows than min_rows, aggregation is bypassed, in the sence that all the input rows become exemplars.

n_bins
int

Number of bins for 1D aggregation.

nx_bins
int

Number of bins for the first column for 2D aggregation.

ny_bins
int

Number of bins for the second column for 2D aggregation.

nd_max_bins
int

Maximum number of exemplars for ND aggregation. It is guaranteed that the ND algorithm will return less than nd_max_bins exemplars, but the exact number may vary from run to run due to parallelization.

max_dimensions
int

Number of columns at which the projection method is used for ND aggregation.

seed
int

Seed to be used for the projection method.

double_precision
bool

An option to indicate whether double precision, i.e. float64, or single precision, i.e. float32, arithmetic should be used for computations.

fixed_radius
float

Fixed radius for ND aggregation, use it with caution. If set, nd_max_bins will have no effect and in the worst case number of exemplars may be equal to the number of rows in the data. For big data this may result in extremly large execution times. Since all the columns are normalized to [0, 1), the fixed_radius value should be choosen accordingly.

return
Tuple[Frame, Frame]

The first element in the tuple is the aggregated frame, i.e. the frame containing exemplars, with the shape of (nexemplars, frame.ncols + 1), where nexemplars is the number of gathered exemplars. The first frame.ncols columns are the columns from the input frame, and the last column is the members_count that has stype int32 containing number of members per exemplar.

The second element in the tuple is the members frame with the shape of (frame.nrows, 1), each row in this frame corresponds to the row with the same id in the input frame. The only column exemplar_id has an stype of int32 and contains the exemplar ids a particular member belongs to. These ids are effectively the ids of the exemplar’s rows from the input frame.

except
ValueError

The exception is raised if the input frame is missing.

except
TypeError

The exception is raised when one of the frame’s columns has an unsupported stype, i.e. the column is both non-numeric and non-string.

datatable.models.kfold()

Perform k-fold split of data with nrows rows into nsplits train/test subsets. The dataset itself is not passed to this function: it is sufficient to know only the number of rows in order to decide how the data should be split.

The range [0; nrows) is split into nsplits approximately equal parts, i.e. folds, and then each i-th split will use the i-th fold as a test part, and all the remaining rows as the train part. Thus, i-th split is comprised of:

  • train rows: [0; i*nrows/nsplits) + [(i+1)*nrows/nsplits; nrows);

  • test rows: [i*nrows/nsplits; (i+1)*nrows/nsplits).

where integer division is assumed.

Parameters
nrows
int

The number of rows in the frame that is going to be split.

nsplits
int

Number of folds, must be at least 2, but not larger than nrows.

return
List[Tuple]

This function returns a list of nsplits tuples (train_rows, test_rows), where each component of the tuple is a rows selector that can be applied to any frame with nrows rows to select the desired folds. Some of these row selectors will be simple python ranges, others will be single-column Frame objects.

See Also

kfold_random() – Perform randomized k-fold split.

datatable.models.kfold_random()

Perform randomized k-fold split of data with nrows rows into nsplits train/test subsets. The dataset itself is not passed to this function: it is sufficient to know only the number of rows in order to decide how the data should be split.

The train/test subsets produced by this function will have the following properties:

  • all test folds will be of approximately the same size nrows/nsplits;

  • all observations have equal ex-ante chance of getting assigned into each fold;

  • the row indices in all train and test folds will be sorted.

The function uses single-pass parallelized algorithm to construct the folds.

Parameters
nrows
int

The number of rows in the frame that you want to split.

nsplits
int

Number of folds, must be at least 2, but not larger than nrows.

seed
int

Seed value for the random number generator used by this function. Calling the function several times with the same seed values will produce same results each time.

return
List[Tuple]

This function returns a list of nsplits tuples (train_rows, test_rows), where each component of the tuple is a rows selector that can be applied to to any frame with nrows rows to select the desired folds.

See Also

kfold() – Perform k-fold split.

datatable.options

This namespace contains the following datatable option groups:

.debug

Debug options.

.display

Display options.

.frame

Frame related options.

.fread

fread() related options.

.progress

Progress reporting options.

It also contains the following individual options:

.nthreads

Number of threads used by datatable for parallel computations.

datatable.options.debug

This namespace contains the following debug options:

.arg_max_size

The number of characters to use per a function/method argument.

.enabled

Switch that enables logging of the debug information.

.logger

The custom logger object.

.report_args

Switch that enables logging of the function/method arguments.

datatable.options.debug.arg_max_size

When the debug.report_args is True, this option will limit the display size of each argument in order to prevent potentially huge outputs. This option’s value cannot be less than 10.

datatable.options.debug.enabled

If True, all calls to datatable core functions will be logged, together with their timings.

datatable.options.debug.logger

The logger object used for reporting calls to datatable core functions. If None, then the default (built-in) logger will be used. This option has no effect if debug.enabled is False.

datatable.options.debug.report_args

Controls whether log messages about function and method calls contain information about the arguments of those calls.

datatable.options.display

This namespace contains the following display options:

.allow_unicode

Switch that controls if the unicode characters are allowed.

.head_nrows

The number of top rows to display when the frame view is truncated.

.interactive

Switch that controls if the interactive view is enabled or not.

.max_column_width

The threshold for the column’s width to be truncated.

.max_nrows

The threshold for the number of rows in a frame to be truncated.

.tail_nrows

The number of bottom rows to display when the frame view is truncated.

.use_colors

Switch that controls if colors should be used in the console.

datatable.options.display.allow_unicode

If True, datatable will allow unicode characters (encoded as UTF-8) to be printed into the output. If False, then unicode characters will either be avoided, or hex-escaped as necessary.

datatable.options.display.head_nrows

The number of rows from the top of a frame to be displayed when the frame’s output is truncated due to the total number of frame’s rows exceeding display.max_nrows value.

datatable.options.display.interactive

This option controls the behavior of a Frame when it is viewed in a text console. When True, the Frame will be shown in the interactove mode, allowing you to navigate the rows/columns with keyboard. When False, the Frame will be shown in regular, non-interactive mode (you can still call DT.view() to enter the interactive mode manually.

datatable.options.display.max_column_width

A column’s name or values that exceed max_column_width in size will be truncated. This option applies both to rendering a frame in a terminal, and to rendering in a Jupyter notebook. The smallest allowed max_column_width is 2. Setting the value to None indicates that the column’s content should never be truncated.

datatable.options.display.max_nrows

A frame with more rows than this will be displayed truncated when the frame is printed to the console: only its first display.head_nrows and last display.tail_nrows rows will be printed. It is recommended to have head_nrows + tail_nrows <= max_nrows. Setting this option to None (or a negative value) will cause all rows in a frame to be printed, which may cause the console to become unresponsive.

datatable.options.display.tail_nrows

The number of rows from the bottom of a frame to be displayed when the frame’s output is truncated due to the total number of frame’s rows exceeding display.max_nrows value.

datatable.options.display.use_colors

Whether to use colors when printing various messages into the console. Turn this off if your terminal is unable to display ANSI escape sequences, or if the colors make output not legible.

datatable.options.frame

This namespace contains the following Frame options:

.names_auto_index

Initial value of the default column name index.

.names_auto_prefix

Default column name prefix.

datatable.options.frame.names_auto_index

When Frame needs to auto-name columns, they will be assigned names C0, C1, C2, etc. by default. This option allows you to control the starting index in this sequence. For example, setting dt.options.frame.names_auto_index=1 will cause the columns to be named C1, C2, C3, etc.

See Also
datatable.options.frame.names_auto_prefix

When Frame needs to auto-name columns, they will be assigned names C0, C1, C2, etc. by default. This option allows you to control the prefix used in this sequence. For example, setting dt.options.frame.names_auto_prefix='Z' will cause the columns to be named Z0, Z1, Z2, etc.

See Also
datatable.options.fread

This namespace contains the following fread() option groups:

.log

Logging related options.

datatable.options.fread.log

This property controls the following logging options:

.anonymize

Switch that controls if the logs should be anonymized.

.escape_unicode

Switch that controls if the unicode characters should be escaped.

datatable.options.fread.log.anonymize

If True, any snippets of data being read that are printed in the log will be first anonymized by converting all non-0 digits to 1, all lowercase letters to a, all uppercase letters to A, and all unicode characters to U.

This option is useful in production systems when reading sensitive data that must not accidentally leak into log files or be printed with the error messages.

datatable.options.fread.log.escape_unicode

If True, all unicode characters in the verbose log will be written in hexadecimal notation. Use this option if your terminal cannot print unicode, or if the output gets somehow corrupted because of the unicode characters.

datatable.options.progress

This namespace contains the following progress reporting options:

.allow_interruption

Switch that controls if the datatable tasks could be interrupted.

.callback

A custom progress-reporting function.

.clear_on_success

Switch that controls if the progress bar is cleared on success.

.enabled

Switch that controls if the progress reporting is enabled.

.min_duration

The minimum duration of a task to show the progress bar.

.updates_per_second

The progress bar update frequency.

datatable.options.progress.allow_interruption

If True, allow datatable to handle the SIGINT signal to interrupt long-running tasks.

datatable.options.progress.callback

If None, then the built-in progress-reporting function will be used. Otherwise, this value specifies a function to be called at each progress event. The function takes a single parameter p, which is a namedtuple with the following fields:

  • p.progress is a float in the range 0.0 .. 1.0;

  • p.status is a string, one of ‘running’, ‘finished’, ‘error’ or ‘cancelled’; and

  • p.message is a custom string describing the operation currently being performed.

datatable.options.progress.clear_on_success

If True, clear progress bar when job finished successfully.

datatable.options.progress.enabled

When False, progress reporting functionality will be turned off. This option is True by default if the stdout is connected to a terminal or a Jupyter Notebook, and False otherwise.

datatable.options.progress.min_duration

Do not show progress bar if the duration of an operation is smaller than this value. If this setting is non-zero, then the progress bar will only be shown for long-running operations, whose duration (estimated or actual) exceeds this threshold.

datatable.options.progress.updates_per_second

Number of times per second the display of the progress bar should be updated.

datatable.options.nthreads

The number of threads used by datatable internally.

Many calculations in datatable module are parallelized. This setting controls how many threads will be used during such calculations.

Initially, this option is set to the value returned by C++ call std::thread::hardware_concurrency(). This is usually equal to the number of available cores.

You can set nthreads to a value greater or smaller than the initial setting. For example, setting nthreads = 1 will force the library into a single-threaded mode. Setting nthreads to 0 will restore the initial value equal to the number of processor cores. Setting nthreads to a value less than 0 is equivalent to requesting that fewer threads than the maximum.

datatable.FExpr

FExpr is an object that encapsulates computations to be done on a frame.

FExpr objects are rarely constructed directly (though it is possible too), instead they are more commonly created as inputs/outputs from various functions in datatable.

Consider the following example:

math.sin(2 * f.Angle)

Here accessing column “Angle” in namespace f creates an FExpr. Multiplying this FExpr by a python scalar 2 creates a new FExpr. And finally, applying the sine function creates yet another FExpr. The resulting expression can be applied to a frame via the DT[i,j] method, which will compute that expression using the data of that particular frame.

Thus, an FExpr is a stored computation, which can later be applied to a Frame, or to multiple frames.

Because of its delayed nature, an FExpr checks its correctness at the time when it is applied to a frame, not sooner. In particular, it is possible for the same expression to work with one frame, but fail with another. In the example above, the expression may raise an error if there is no column named “Angle” in the frame, or if the column exists but has non-numeric type.

Most functions in datatable that accept an FExpr as an input, return a new FExpr as an output, thus creating a tree of FExprs as the resulting evaluation graph.

Also, all functions that accept FExprs as arguments, will also accept certain other python types as an input, essentially converting them into FExprs. Thus, we will sometimes say that a function accepts FExpr-like objects as arguments.

All binary operators op(x, y) listed below work when either x or y, or both are FExprs.

Construction

.__init__(e)

Create an FExpr.

.extend()

Append another FExpr.

.remove()

Remove columns from the FExpr.

Arithmeritc operators

__add__(x, y)

Addition x + y.

__sub__(x,  y)

Subtraction x - y.

__mul__(x, y)

Multiplication x * y.

__truediv__(x, y)

Division x / y.

__floordiv__(x, y)

Integer division x // y.

__mod__(x, y)

Modulus x % y (the remainder after integer division).

__pow__(x, y)

Power x ** y.

__pos__(x)

Unary plus +x.

__neg__(x)

Unary minus -x.

Bitwise operators

__and__(x, y)

Bitwise AND x & y.

__or__(x, y)

Bitwise OR x | y.

__xor__(x, y)

Bitwise XOR x ^ y.

__invert__(x)

Bitwise NOT ~x.

__lshift__(x, y)

Left shift x << y.

__rshift__(x, y)

Right shift x >> y.

Relational operators

__eq__(x, y)

Equal x == y.

__ne__(x, y)

Not equal x != y.

__lt__(x, y)

Less than x < y.

__le__(x, y)

Less than or equal x <= y.

__gt__(x, y)

Greater than x > y.

__ge__(x, y)

Greater than or equal x >= y.

Miscellaneous

.__bool__()

Implicitly convert FExpr into a boolean value.

.__repr__()

Used by Python function repr().

.len()

String length.

.re_match(pattern)

Check whether the string column matches a pattern.

datatable.FExpr.__add__()

Add two FExprs together, which corresponds to python operator +.

If x or y are multi-column expressions, then they must have the same number of columns, and the + operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The result of adding two columns with different stypes will have the following stype:

  • max(x.stype, y.stype, int32) if both columns are numeric (i.e. bool, int or float);

  • str32/str64 if at least one of the columns is a string. In this case the + operator implements string concatenation, same as in Python.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x + y.

datatable.FExpr.__and__()

Compute bitwise AND of x and y.

If x or y are multi-column expressions, then they must have the same number of columns, and the & operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The AND operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.

When both x and y are boolean, then the bitwise AND operator is equivalent to logical AND. This can be used to combine several logical conditions into a compound (since Python doesn’t allow overloading of operator and). Beware, however, that & has higher precedence than and, so it is advisable to always use parentheses:

DT[(f.x >= 0) & (f.x <= 1), :]
Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x & y.

Notes

Warning

Use x & y in order to AND two boolean FExprs. Using standard Python keyword and will result in an error.

datatable.FExpr.__bool__()

Using this operator will result in a TypeError.

The boolean-cast operator is used by Python whenever it wants to know whether the object is equivalent to a single True or False value. This is not applicable for a FExpr, which represents stored computation on a column or multiple columns. As such, an error is raised.

In order to convert a column into the boolean stype, you can use the type-cast operator dt.bool8(x).

datatable.FExpr.__eq__()

Compare whether values in columns x and y are equal.

Like all other FExpr operators, the equality operator is elementwise: it produces a column where each element is the result of comparison x[i] == y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the == operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The equality operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison. In practice it means, for example, that `1 == "1".

Lastly, the comparison x == None is exactly equivalent to the isna() function.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x == y. The produced column will have stype bool8.

datatable.FExpr.__floordiv__()

Perform integer division of two FExprs, i.e. x // y.

The modulus and integer division together satisfy the identity that x == (x // y) * y + (x % y) for all non-zero values of y.

If x or y are multi-column expressions, then they must have the same number of columns, and the // operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The integer division operation can only be applied to integer columns. The resulting column will have stype equal to the largest of the stypes of both columns, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x // y.

See also
  • x / y – regular division operator.

datatable.FExpr.__ge__()

Compare whether x >= y.

Like all other FExpr operators, the greater-than-or-equal operator is elementwise: it produces a column where each element is the result of comparison x[i] >= y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the >= operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The greater-than-or-equal operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x >= y. The produced column will have stype bool8.

datatable.FExpr.__gt__()

Compare whether x > y.

Like all other FExpr operators, the greater-than operator is elementwise: it produces a column where each element is the result of comparison x[i] > y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the > operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The greater-than operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x > y. The produced column will have stype bool8.

datatable.FExpr.__init__()

Create a new FExpr object out of e.

The FExpr is serve as a simple wrapper of the underlying object, allowing it to be combined with othef FExprs.

This constructor is almost never needs to be run manually by the user.

Parameters
e
None | bool | int | str | float | slice | list | tuple | dict | type | stype | ltype | Generator | FExpr | Frame | range | pd.DataFrame | pd.Series | np.array | np.ma.masked_array

The argument that will be converted into an FExpr.

datatable.FExpr.__invert__()

Compute bitwise NOT of x, which corresponds to python operation ~x.

If x is a multi-column expressions, then the ~ operator will be applied to each column in turn.

Bitwise NOT can only be applied to integer or boolean columns. The resulting column will have the same stype as its argument.

When the argument x is a boolean column, then ~x is equivalent to logical NOT. This can be used to negate a condition, similar to python operator not (which is not overloadable).

Parameters
x
FExpr

Either an FExpr, or any object that can be converted into FExpr.

return
FExpr

An expression that evaluates ~x.

Notes

Warning

Use ~x in order to negate a boolean FExpr. Using standard Python keyword not will result in an error.

datatable.FExpr.__le__()

Compare whether x <= y.

Like all other FExpr operators, the less-than-or-equal operator is elementwise: it produces a column where each element is the result of comparison x[i] <= y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the <= operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The less-than-or-equal operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x <= y. The produced column will have stype bool8.

datatable.FExpr.__lshift__()

Shift x by y bits to the left, i.e. x << y. Mathematically this is equivalent to \(x\cdot 2^y\).

If x or y are multi-column expressions, then they must have the same number of columns, and the << operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x << y.

See also
datatable.FExpr.__lt__()

Compare whether x < y.

Like all other FExpr operators, the less-than operator is elementwise: it produces a column where each element is the result of comparison x[i] < y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the < operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The less-than operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x < y. The produced column will have stype bool8.

datatable.FExpr.__mod__()

Compute the remainder of division of two FExprs, i.e. x % y.

The modulus and integer division together satisfy the identity that x == (x // y) * y + (x % y) for all non-zero values of y. In addition, the result of x % y is always in the range [0; y) for positive y, and in the range (y; 0] for negative y.

If x or y are multi-column expressions, then they must have the same number of columns, and the % operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The integer division operation can only be applied to integer columns. The resulting column will have stype equal to the largest of the stypes of both columns, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x % y.

See also
  • x // y – integer division operator.

datatable.FExpr.__mul__()

Multiply two FExprs together, which corresponds to python operator *.

If x or y are multi-column expressions, then they must have the same number of columns, and the * operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The multiplication operation can only be applied to numeric columns. The resulting column will have stype equal to the larger of the stypes of its arguments, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x * y.

datatable.FExpr.__ne__()

Compare whether values in columns x and y are not equal.

Like all other FExpr operators, the equality operator is elementwise: it produces a column where each element is the result of comparison x[i] != y[i].

If x or y are multi-column expressions, then they must have the same number of columns, and the != operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The inequality operator can be applied to columns of any type, and the types of x and y are allowed to be different. In the latter case the columns will be converted into a common stype before the comparison.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x != y. The produced column will have stype bool8.

datatable.FExpr.__neg__()

Unary minus, which corresponds to python operation -x.

If x is a multi-column expressions, then the - operator will be applied to each column in turn.

Unary minus can only be applied to numeric columns. The resulting column will have the same stype as its argument, but not less than int32.

Parameters
x
FExpr

Either an FExpr, or any object that can be converted into FExpr.

return
FExpr

An expression that evaluates -x.

datatable.FExpr.__or__()

Compute bitwise OR of x and y.

If x or y are multi-column expressions, then they must have the same number of columns, and the | operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The OR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.

When both x and y are boolean, then the bitwise OR operator is equivalent to logical OR. This can be used to combine several logical conditions into a compound (since Python doesn’t allow overloading of operator or). Beware, however, that | has higher precedence than or, so it is advisable to always use parentheses:

DT[(f.x < -1) | (f.x > 1), :]
Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x | y.

Notes

Warning

Use x | y in order to OR two boolean FExprs. Using standard Python keyword or will result in an error.

datatable.FExpr.__pos__()

Unary plus, which corresponds to python operation +x.

If x is a multi-column expressions, then the + operator will be applied to each column in turn.

Unary plus can only be applied to numeric columns. The resulting column will have the same stype as its argument, but not less than int32.

Parameters
x
FExpr

Either an FExpr, or any object that can be converted into FExpr.

return
FExpr

An expression that evaluates +x.

datatable.FExpr.__pow__()

Raise x to the power y, or in math notation \(x^y\).

If x or y are multi-column expressions, then they must have the same number of columns, and the ** operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The power operator can only be applied to numeric columns, and the resulting column will have stype float64 in all cases except when both arguments are float32 (in which case the result is also float32).

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x ** y.

datatable.FExpr.__repr__()

Return string representation of this object. This method is used by Python’s built-in function repr().

The returned string has the following format:

"FExpr<...>"

where ... will attempt to match the expression used to construct this FExpr.

Examples
repr(3 + 2*(f.A + f["B"]))
"FExpr<3 + 2 * (f.A + f['B'])>"
datatable.FExpr.__rshift__()

Shift x by y bits to the right, i.e. x >> y. Mathematically this is equivalent to \(\lfloor x\cdot 2^{-y} \rfloor\).

If x or y are multi-column expressions, then they must have the same number of columns, and the >> operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The left-shift operator can only be applied to integer columns, and the resulting column will have the same stype as its argument.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x >> y.

See also
datatable.FExpr.__sub__()

Subtract two FExprs, which corresponds to python operation x - y.

If x or y are multi-column expressions, then they must have the same number of columns, and the - operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The subtraction operation can only be applied to numeric columns. The resulting column will have stype equal to the larger of the stypes of its arguments, but at least int32.

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x - y.

datatable.FExpr.__truediv__()

Divide two FExprs, which corresponds to python operation x / y.

If x or y are multi-column expressions, then they must have the same number of columns, and the / operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The division operation can only be applied to numeric columns. The resulting column will have stype float64 in all cases except when both arguments have stype float32 (in which case the result is also float32).

Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x / y.

See also
  • x // y – integer division operator.

datatable.FExpr.__xor__()

Compute bitwise XOR of x and y.

If x or y are multi-column expressions, then they must have the same number of columns, and the ^ operator will be applied to each corresponding pair of columns. If either x or y are single-column while the other is multi-column, then the single-column expression will be repeated to the same number of columns as its opponent.

The XOR operator can only be applied to integer or boolean columns. The resulting column will have stype equal to the larger of the stypes of its arguments.

When both x and y are boolean, then the bitwise XOR operator is equivalent to logical XOR. This can be used to combine several logical conditions into a compound (since Python doesn’t allow overloading of operator xor). Beware, however, that ^ has higher precedence than xor, so it is advisable to always use parentheses:

DT[(f.x == 0) ^ (f.y == 0), :]
Parameters
x
,
y
FExpr

The arguments must be either FExprs, or expressions that can be converted into FExprs.

return
FExpr

An expression that evaluates x ^ y.

datatable.FExpr.extend()

Append FExpr arg to the current FExpr.

Each FExpr represents a collection of columns, or a columnset. This method takes two such columnsets and combines them into a single one, similar to cbind().

Parameters
arg
FExpr

The expression to append.

return
FExpr

New FExpr which is a combination of the current FExpr and arg.

See also
  • remove() – remove columns from a columnset.

datatable.FExpr.len()

Deprecated since version 0.11.

Return the string length for a string column. This method can only be applied to string columns, and it returns an integer column as a result.

Since version 1.0 this function will be available in the str. module.

datatable.FExpr.re_match()

Deprecated since version 0.11.

Test whether values in a string column match a regular expression.

Since version 1.0 this function will be available in the re. submodule.

Parameters
pattern
str

The regular expression that will be tested against each value in the current column.

flags
int

[unused]

return
FExpr

Return an expression that produces boolean column that tells whether the value in each row of the current column matches the pattern or not.

datatable.FExpr.remove()

Remove columns arg from the current FExpr.

Each FExpr represents a collection of columns, or a columnset. Some of those columns are computed while others are specified “by reference”, for example f.A, f[:3] or f[int]. This method allows you to remove by-reference columns from an existing FExpr.

Parameters
arg
FExpr

The columns to remove. These must be “columns-by-reference”, i.e. they cannot be computed columns.

return
FExpr

New FExpr which is a obtained from the current FExpr by removing the columns in arg.

See also

datatable.Frame

Two-dimensional column-oriented container of data. This the primary data structure in the datatable module.

A Frame is two-dimensional in the sense that it is comprised of rows and columns of data. Each data cell can be located via a pair of its coordinates: (irow, icol). We do not support frames with more or less than two dimensions.

A Frame is column-oriented in the sense that internally the data is stored separately for each column. Each column has its own name and type. Types may be different for different columns but cannot vary within each column.

Thus, the dimensions of a Frame are not symmetrical: a Frame is not a matrix. Internally the class is optimized for the use case when the number of rows significantly exceeds the number of columns.

A Frame can be viewed as a list of columns: standard Python function len() will return the number of columns in the Frame, and frame[j] will return the column at index j (each “column” will be a Ffame with ncols == 1). Similarly, you can iterate over the columns of a Frame in a loop, or use it in a *-expansion:

for column in frame:
    ...

list_of_columns = [*frame]

A Frame can also be viewed as a dict of columns, where the key associated with each column is its name. Thus, frame[name] will return the column with the requested name. A Frame can also work with standard python **-expansion:

dict_of_columns = {**frame}
Construction

Frame(*args, **kws)

Construct the frame from various Python sources.

dt.fread(src)

Read an external file and convert into a Frame.

.copy()

Create a copy of the frame.

Frame manipulation

frame[i, j, ...]

Primary method for extracting data from a frame.

frame[i, j, ...] = values

Update data within the frame.

del frame[i, j, ...]

Remove rows/columns/values from the frame.

.cbind(*frames)

Append columns of other frames to this frame.

.rbind(*frames)

Append other frames at the bottom of the current.

.replace(what, with)

Search and replace values in the frame.

.sort(cols)

Sort the frame by the specified columns.

Convert into other formats

.to_csv(file)

Write the frame’s data into CSV format.

.to_dict()

Convert the frame into a Python dictionary, by columns.

.to_jay(file)

Store the frame’s data into a binary file in Jay format.

.to_list()

Return the frame’s data as a list of lists, by columns.

.to_numpy()

Convert the frame into a numpy array.

.to_pandas()

Convert the frame into a pandas DataFrame.

.to_tuples()

Return the frame’s data as a list of tuples, by rows.

Properties

.key

The primary key for the Frame, if any.

.ltypes

Logical types (ltypes) of all columns.

.names

The names of all columns in the frame.

.ncols

Number of columns in the frame.

.nrows

Number of rows in the frame.

.stype

A tuple (number of rows, number of columns).

.source

Where this frame was loaded from.

.stype

The common stype for the entire frame.

.stypes

Storage types (stypes) of all columns.

Other methods

.colindex(name)

Find the position of a column in the frame by its name.

.export_names()

Create python variables for each column of the frame.

.head()

Return the first few rows of the frame.

.materialize()

Make sure all frame’s data is physically written to memory.

.tail()

Return the last few rows of the frame.

Special methods

These methods are not intended to be called manually, instead they provide a way for datatable to interoperate with other Python modules or builtin functions.

.__copy__()

Used by Python module copy.

.__deepcopy__()

Used by Python module copy.

.__delitem__()

Method that implements the del DT[...] call.

.__getitem__()

Method that implements the DT[...] call.

.__getstate__()

Used by Python module pickle.

.__init__(...)

The constructor function.

.__iter__()

Used by Python function iter(), or when the frame is used as a target in a loop.

.__len__()

Used by Python function len().

.__repr__()

Used by Python function repr().

.__reversed__()

Used by Python function reversed().

.__setitem__()

Method that implements the DT[...] = expr call.

.__setstate__()

Used by Python module pickle.

.__sizeof__()

Used by sys.getsizeof().

.__str__()

Used by Python function str.

._repr_html_()

Used to display the frame in Jupyter Lab.

._repr_pretty_()

Used to display the frame in an IPython console.

datatable.Frame.__init__()
Frame
(
_data=None
,
*
,
names=None
,
stypes=None
,
stype=None
, )

Create a new Frame from a single or multiple sources.

Argument _data (or **cols) contains the source data for Frame’s columns. Column names are either derived from the data, given explicitly via the names argument, or generated automatically. Either way, the constructor ensures that column names are unique, non-empty, and do not contain certain special characters (see Name mangling for details).

Parameters
_data
Any

The first argument to the constructor represents the source from which to construct the Frame. If this argument is given, then the varkwd arguments **cols should not be used.

This argument can accept a wide range of data types; see the “Details” section below.

**cols
Any

Sequence of varkwd column initializers. The keys become column names, and the values contain column data. Using varkwd arguments is equivalent to passing a dict as the _data argument.

When varkwd initializers are used, the names parameter may not be given.

names
List[str|None]

Explicit list (or tuple) of column names. The number of elements in the list must be the same as the number of columns being constructed.

This parameter should not be used when constructing the frame from **cols.

stypes
List[stype-like] | Dict[str, stype-like]

Explicit list (or tuple) of column types. The number of elements in the list must be the same as the number of columns being constructed.

stype
stype | type

Similar to stypes, but provide a single type that will be used for all columns. This option cannot be specified together with stypes.

return
Frame

A Frame object is constructed and returned.

except
ValueError

The exception is raised if the lengths of names or stypes lists are different from the number of columns created, or when creating several columns and they have incompatible lengths.

Details

The shape of the constructed Frame depends on the type of the source argument _data (or **cols). The argument _data and varkwd arguments **cols are mutually exclusive: they cannot be used at the same time. However, it is possible to use neither and construct an empty frame:

dt.Frame()       # empty 0x0 frame
dt.Frame(None)   # same
dt.Frame([])     # same

The varkwd arguments **cols can be used to construct a Frame by columns. In this case the keys become column names, and the values are column initializers. This form is mostly used for convenience, it is equivalent to converting cols into a dict and passing as the first argument:

dt.Frame(A = range(7),
         B = [0.1, 0.3, 0.5, 0.7, None, 1.0, 1.5],
         C = ["red", "orange", "yellow", "green", "blue", "indigo", "violet"])
# equivalent to
dt.Frame({"A": range(7), "B": [0.1, 0.3, ...], "C": ["red", "orange", ...]})

The argument _data accepts a wide range of input types. The following list describes possible choices:

List[List | Frame | np.array | pd.DataFrame | pd.Series | range | typed_list]

When the source is a non-empty list containing other lists or compound objects, then each item will be interpreted as a column initializer, and the resulting frame will have as many columns as the number of items in the list.

Each element in the list must produce a single column. Thus, it is not allowed to use multi-column Frames, or multi-dimensional numpy arrays or pandas DataFrames.

>>> dt.Frame([[1, 3, 5, 7, 11],
...           [12.5, None, -1.1, 3.4, 9.17]])
   | C0     C1
-- + --  -----
 0 |  1  12.5
 1 |  3  NA
 2 |  5  -1.1
 3 |  7   3.4
 4 | 11   9.17
--
[5 rows x 2 columns]

Note that unlike pandas and numpy, we treat a list of lists as a list of columns, not a list of rows. If you need to create a Frame from a row-oriented store of data, you can use a list of dictionaries or a list of tuples as described below.

List[Dict]

If the source is a list of dict objects, then each element in this list is interpreted as a single row. The keys in each dictionary are column names, and the values contain contents of each individual cell.

The rows don’t have to have the same number or order of entries: all missing elements will be filled with NAs:

>>> dt.Frame([{"A": 3, "B": 7},
...           {"A": 0, "B": 11, "C": -1},
...           {"C": 5}])
   |  A   B   C
-- + --  --  --
 0 |  3   7  NA
 1 |  0  11  -1
 2 | NA  NA   5
--
[3 rows x 3 columns]

If the names parameter is given, then only the keys given in the list of names will be taken into account, all extra fields will be discarded.

List[Tuple]

If the source is a list of tuples, then each tuple represents a single row. The tuples must have the same size, otherwise an exception will be raised:

>>> dt.Frame([(39, "Mary"),
...           (17, "Jasmine"),
...           (23, "Lily")], names=['age', 'name'])
   | age  name
-- + ---  -------
 0 |  39  Mary
 1 |  17  Jasmine
 2 |  23  Lily
--
[3 rows x 2 columns]

If the tuples are in fact namedtuples, then the field names will be used for the column names in the resulting Frame. No check is made whether the named tuples in fact belong to the same class.

List[Any]

If the list’s first element does not match any of the cases above, then it is considered a “list of primitives”. Such list will be parsed as a single column.

The entries are typically bools, ints, floats, strs, or Nones; numpy scalars are also allowed. If the list has elements of heterogeneous types, then we will attempt to convert them to the smallest common stype.

If the list contains only boolean values (or Nones), then it will create a column of type bool8.

If the list contains only integers (or Nones), then the resulting column will be int8 if all integers are 0 or 1; or int32 if all entries are less than \(2^{31}\) in magnitude; otherwise int64 if all entries are less than \(2^{63}\) in magnitude; or otherwise float64.

If the list contains floats, then the resulting column will have stype float64. Both None and math.nan can be used to input NA values.

Finally, if the list contains strings then the column produced will have stype str32 if the total size of the character is less than 2Gb, or str64 otherwise.

typed_list

A typed list can be created by taking a regular list and dividing it by an stype. It behaves similarly to a simple list of primitives, except that it is parsed into the specific stype.

>>> dt.Frame([1.5, 2.0, 3.87] / dt.float32).stype
stype.float32
Dict[str, Any]

The keys are column names, and values can be any objects from which a single-column frame can be constructed: list, range, np.array, single-column Frame, pandas series, etc.

Constructing a frame from a dictionary d is exactly equivalent to calling dt.Frame(list(d.values()), names=list(d.keys())).

range

Same as if the range was expanded into a list of integers, except that the column created from a range is virtual and its creation time is nearly instant regardless of the range’s length.

Frame

If the argument is a Frame, then a shallow copy of that frame will be created, same as copy().

str

If the source is a simple string, then the frame is created by fread-ing this string. In particular, if the string contains the name of a file, the data will be loaded from that file; if it is a URL, the data will be downloaded and parsed from that URL. Lastly, the string may simply contain a table of data.

>>> DT1 = dt.Frame("train.csv")
>>> DT2 = dt.Frame("""
...    Name    Age
...    Mary     39
...    Jasmine  17
...    Lily     23
... """)
pd.DataFrame | pd.Series

A pandas DataFrame (Series) will be converted into a datatable Frame. Column names will be preserved.

Column types will generally be the same, assuming they have a corresponding stype in datatable. If not, the column will be converted. For example, pandas date/time column will get converted into string, while float16 will be converted into float32.

If a pandas frame has an object column, we will attempt to refine it into a more specific stype. In particular, we can detect a string or boolean column stored as object in pandas.

np.array

A numpy array will get converted into a Frame of the same shape (provided that it is 2- or less- dimensional) and the same type.

If possible, we will create a Frame without copying the data (however, this is subject to numpy’s approval). The resulting frame will have a copy-on-write semantics.

None

When the source is not given at all, then a 0x0 frame will be created; unless a names parameter is provided, in which case the resulting frame will have 0 rows but as many columns as given in the names list.

datatable.Frame.__copy__()

This method facilitates copying of a Frame via the python standard module copy. See copy() for more details.

datatable.Frame.__delitem__()
del
self
[
, [, ][, ][, ]]

This methods deletes rows and columns that would have been selected from the frame if not for the del keyword.

All parameters have the same meaning as in the getter DT[i, j, ...], with the only restriction that j must select columns from the main frame only (i.e. not from the joined frame(s)), and it must select them by reference. Selecting by reference means it should be possible to tell where each column was in the original frame.

There are several modes of delete operation, depending on whether i or j are “slice-all” symbols:

  • del DT[:, :] removes everything from the frame, making it 0x0;

  • del DT[:, j] removes columns j from the frame;

  • del DT[i, :] removes rows i from the frame;

  • del DT[i, j] the shape of the frame remains the same, but the elements at [i, j] locations are replaced with NAs.

datatable.Frame.__getitem__()

The main method for accessing data and computing on the frame. Sometimes we also refer to it as the DT[i, j, ...] call.

Since Python does not support keyword arguments inside square brackets, all arguments are positional. The first is the row selector i, the second is the column selector j, and the rest are optional. Thus, DT[i, j] selects rows i and columns j from frame DT.

If an additional by argument is present, then the selectors i and j work within groups generated by the by() expression. The sort argument reorders the rows of the frame, and the join argument allows performing SQL joins between several frames.

The signature listed here is the most generic. But there are also special-case signatures DT[j] and DT[i, j] described below.

Parameters
i
int | slice | Frame | FExpr | List[bool] | List[Any]

The row selector.

If this is an integer or a slice, then the behavior is the same as in Python when working on a list with nrows elements. In particular, the integer value must be within the range [-nrows; nrows). On the other hand when i is a slice, then either its start or end or both may be safely outside the row-range of the frame. The trivial slice : always selects all rows.

i may also be a single-column boolean Frame. It must have the same number of rows as the current frame, and it serves as a mask for which rows are to be selected: True indicates that the row should be included in the result, while False and None skips the row.

i may also be a single-column integer Frame. Such column specifies directly which row indices are to be selected. This is more flexible compared to a boolean column: the rows may be repeated, reordered, omitted, etc. All values in the column i must be in the range [0; nrows) or an error will be thrown. In particular, negative indices are not allowed. Also, if the column contains NA values, then it would produce an “invalid row”, i.e. a row filled with NAs.

i may also be an expression, which must evaluate into a single column, either boolean or integer. In this case the result is the same as described above for a single-column frame.

When i is a list of booleans, then it is equivalent to a single-column boolean frame. In particular, the length of the list must be equal to nrows.

Finally, i can be a list of any of the above (integers, slices, frames, expressions, etc), in which case each element of the list is evaluated separately and then all selected rows are put together. The list may contain Nones, which will be simply skipped.

j
int | str | slice | list | dict | type | FExpr | update

This argument may either select columns, or perform computations with the columns.

int

Select a single column at the specified index. An IndexError is raised if j is not in the range [-ncols; ncols).

str

Select a single column by name. A KeyError is raised if the column with such a name does not exist.

:

This is a trivial slice, and it means “select everything”, and is roughly equivalent to SQL’s *. In the simple case of DT[i, j] call “selecting everything” means all columns from frame DT. However, when the by() clause is added, then : will now select all columns except those used in the groupby. And if the expression has a join(), then “selecting everything” will produce all columns from all frames, excluding those that were duplicate during a natural join.

slice[int]

An integer slice can be used to select a subset of columns. The behavior of a slice is exactly the same as in base Python.

slice[str]

A string slice is an expression like "colA":"colZ". In this case all columns from "colA" to "colZ" inclusive are selected. And if "colZ" appears before "colA” in the frame, then the returned columns will be in the reverse order.

Both endpoints of the slice must be valid columns (or omitted), or otherwise a KeyError will be raised.

type | stype | ltype

Select only columns of the matching type.

FExpr

An expression formula is computed within the current evaluation context (i.e. it takes into account the current frame, the filter i, the presence of groupby/join parameters, etc). The result of this evaluation is used as-if that colum existed in the frame.

List[bool]

If j is a list of boolean values, then it must have the length of ncols, and it describes which columns are to be selected into the result.

List[Any]

The j can also be a list of elements of any other type listed above, with the only restriction that the items must be homogeneous. For example, you can mix ints and slice[int]s, but not ints and FExprs, or ints and strs.

Each item in the list will be evaluated separately (as if each was the sole element in j), and then all the results will be put together.

Dict[str, FExpr]

A dictionary can be used to select columns/expressions similarly to a list, but assigning them explicit names.

update

As a special case, the j argument may be the update() function, which turns the selection operation into an update. That is, instead of returning the chosen rows/columns, they will be updated instead with the user-supplied values.

by
by

When by() clause is present in the square brackets, the rest of the computations are carried out within the “context of a groupby”. This should generally be equivalent to (a) splitting the frame into separate sub-frames corresponding to each group, (b) applying DT[i, j] separately within each group, (c) row-binding the results for each group. In practice the following operations are affected:

  • all reduction operators such as dt.min() or dt.sum() now work separately within each group. Thus, instead of computing sum over the entire column, it is computed separately within each group in by(), and the resulting column will have as many rows as the number of groups.

  • certain i expressions are re-interpreted as being applied within each group. For example, if i is an integer or a slice, then it will now be selecting row(s) within each group.

  • certain functions (such as dt.shift()) are also “group-aware”, and produce results that take into account the groupby context. Check documentation for each individual function to find out whether it has special treatment for groupby contexts.

In addition, by() also affects the order pf columns in the output frame. Specifically, all columns listed as the groupby keys will be automatically placed at the front of the resulting frame, and also excluded from : or f[:] within j.

sort
sort

This argument can be used to rearrange rows in the resulting frame. See sort() for details.

join
join

Performs a JOIN operation with another frame. The join() clause will calculate how the rows of the current frame match against the rows of the joined frame, and allow you to refer to the columns of the joined frame within i, j or by. In order to access columns of the joined frame use namespace g..

This parameter may be listed multiple times if you need to join with several frames.

return
Frame | None

If j is an update() clause then current frame is modified in-place and nothing is returned.

In all other cases, the returned value is a Frame object constructed from the selected rows and columns (including the computed columns) of the current frame.

Details

The order of evaluation of expressions is that first the join clause(s) are computed, creating a mapping between the rows of the current frame and the joined frame(s). After that we evaluate by+sort. Next, the i filter is applied creating the final index of rows that will be selected. Lastly, we evaluate the j part, taking into account the current groupby and row index(es).

When evaluating j, it is essentially converted into a tree (DAG) of expressions, where each expression is evaluated from the bottom up. That is, we start evaluating from the leaf nodes (which are usually column selectors such as f[0]), and then at each convert the set of columns into a new set. Importantly, each subexpression node may produce columns of 3 types: “scalar”, “grouped”, and “full-size”. Whenever subexpressions of different levels are mixed together, they are upgraded to the highest level. Thus, a scalar may be reused for each group, and a grouped column can interoperate with a regular column by auto-expanding in such a way that it becomes constant within each group.

If, after the j is fully evaluated, it produces a column set of type “grouped”, then the resulting frame will have as many rows as there are groups. If, on the other hand, the column set is “full-size”, then the resulting frame will have as many rows as the original frame.

See Also

Extract a single column j from the frame.

The single-argument version of DT[i, j] works only for j being either an integer (indicating column index) or a string (column name). If you need any other way of addressing column(s) of the frame, use the more versatile DT[:, j] form.

Parameters
j
int | str

The index or name of a column to retrieve.

return
Frame

Single-column frame containing the column at the specified index or with the given name.

except
KeyError

The exception is raised if the column with the given name does not exist in the frame.

except
IndexError

The exception is raised if the column does not exist at the provided index j.

Extract a single value from the frame.

Parameters
i
int

The index of a row

j
int | str

The index or name of a column.

return
None | bool | int | float | str | object

A single value from the frame’s row i and column j.

datatable.Frame.__getstate__()

This method allows the frame to be pickle-able.

Pickling a Frame involves saving it into a bytes object in Jay format, but may be less efficient than saving into a file directly because Python creates a copy of the data for the bytes object.

See to_jay() for more details and caveats about saving into Jay format.

datatable.Frame.__iter__()

Returns an iterator over the frame’s columns.

The iterator is a light-weight object of type frame_iterator, which yields consequent columns of the frame with each iteration.

Thus, the iterator produces the sequence frame[0], frame[1], frame[2], ... until the end of the frame. This works even if the user adds or deletes columns in the frame while iterating. Be careful when inserting/deleting columns at an index that was already iterated over, as it will cause some columns to be skipped or visited more than once.

This method is not intended for manual use. Instead, it is invoked by Python runtime either when you call iter(), or when you use the frame in a loop:

for column in frame:
    # column is a Frame of shape (frame.nrows, 1)
    ...
See Also
datatable.Frame.__len__()

Returns the number of columns in the Frame, same as ncols property.

This special method is used by the python built-in function len(), and allows the Frame class to satisfy python Iterable interface.

datatable.Frame.__repr__()

Returns a simple representation of the frame as a string. This method is used by Python’s built-in function repr().

The returned string has the following format:

"<Frame#{ID} {nrows}x{ncols}>"

where {ID} is the value of id(frame) in hex format. Thus, each frame has its own unique id, though after one frame is deleted its id may be reused by another frame.

See Also
datatable.Frame.__reversed__()

Returns an iterator over the frame’s columns in reverse order.

This is similar to __iter__(), except that the columns are returned in the reverse order, i.e. frame[-1], frame[-2], frame[-3], ....

This function is not intended for manual use. Instead, it is invoked by Python builtin function reversed().

datatable.Frame.__setitem__()

This methods updates values within the frame, or adds new columns to the frame.

All parameters have the same meaning as in the getter DT[i, j, ...], with the only restriction that j must select to columns by reference (i.e. there could not be any computed columns there). On the other hand, j may contain columns that do not exist in the frame yet: these columns will be created.

Parameters
i
...

Row selector.

j
...

Column selector. Computed columns are forbidden, but not-existing (new) columns are allowed.

by
by

Groupby condition.

join
join

Join criterion.

R
FExpr | List[FExpr] | Frame | type | None | bool | int | float | str

The replacement for the selection on the left-hand-side.

None | bool | int | float | str

A simple python scalar can be assigned to any-shape selection on the LHS. If i selects all rows (i.e. the assignment is of the form DT[:, j] = R), then each column in j will be replaced with a constant column containing the value R.

If, on the other hand, i selects only some rows, then the type of R must be consistent with the type of column(s) selected in j. In this case only cells in subset [i, j] will be updated with the value of R; the columns may be promoted within their ltype if the value of R is large in magnitude.

type | stype | ltype

Assigning a type to one or more columns will change the types of those columns. The row selector i must be “slice-all” :.

Frame | FExpr | List[FExpr]

When a frame or an expression is assigned, then the shape of the RHS must match the shape of the LHS. Similarly to the assignment of scalars, types must be compatible when assigning to a subset of rows.

See Also
  • update() – An alternative way to update values in the frame within DT[i, j] getter.

  • Frame.replace() – Search and replace for certain values within the entire frame.

A simplified form of the setter, suitable for a single-column replacement. In this case j may only be an integer or a string.

datatable.Frame.__sizeof__()

Return the size of this Frame in memory.

The function attempts to compute the total memory size of the frame as precisely as possible. In particular, it takes into account not only the size of data in columns, but also sizes of all auxiliary internal structures.

Special cases: if frame is a view (say, d2 = DT[:1000, :]), then the reported size will not contain the size of the data, because that data “belongs” to the original datatable and is not copied. However if a frame selects only a subset of columns (say, d3 = DT[:, :5]), then a view is not created and instead the columns are copied by reference. Frame d3 will report the “full” size of its columns, even though they do not occupy any extra memory compared to DT. This behavior may be changed in the future.

This function is not intended for manual use. Instead, in order to get the size of a frame DT, call sys.getsizeof(DT).

datatable.Frame.__str__()

Returns a string with the Frame’s data formatted as a table, i.e. the same representation as displayed when trying to inspect the frame from Python console.

Different aspects of the stringification process can be controlled via dt.options.display options; but under the default settings the returned string will be sufficiently small to fit into a typical terminal window. If the frame has too many rows/columns, then only a small sample near the start+end of the frame will be rendered.

See Also
datatable.Frame.cbind()

Append columns of one or more frames to the current Frame.

For example, if the current frame has n columns, and you are appending another frame with k columns, then after this method succeeds, the current frame will have n + k columns. Thus, this method is roughly equivalent to pandas.concat(axis=1).

The frames being cbound must all either have the same number of rows, or some of them may have only a single row. Such single-row frames will be automatically expanded, replicating the value as needed. This makes it easy to create constant columns or to append reduction results (such as min/max/mean/etc) to the current Frame.

If some of the frames have an incompatible number of rows, then the operation will fail with an InvalidOperationError. However, if you set the flag force to True, then the error will no longer be raised - instead all frames that are shorter than the others will be padded with NAs.

If the frames being appended have the same column names as the current frame, then those names will be mangled to ensure that the column names in the current frame remain unique. A warning will also be issued in this case.

Parameters
frames
Frame | List[Frame] | None

The list/tuple/sequence/generator expression of Frames to append to the current frame. The list may also contain None values, which will be simply skipped.

force
bool

If True, allows Frames to be appended even if they have unequal number of rows. The resulting Frame will have number of rows equal to the largest among all Frames. Those Frames which have less than the largest number of rows, will be padded with NAs (with the exception of Frames having just 1 row, which will be replicated instead of filling with NAs).

return
None

This method alters the current frame in-place, and doesn’t return anything.

except
InvalidOperationError

If trying to cbind frames with the number of rows different from the current frame’s, and the option force is not set.

Notes

Cbinding frames is a very cheap operation: the columns are copied by reference, which means the complexity of the operation depends only on the number of columns, not on the number of rows. Still, if you are planning to cbind a large number of frames, it will be beneficial to collect them in a list first and then call a single cbind() instead of cbinding them one-by-one.

It is possible to cbind frames using the standard DT[i,j] syntax:

df[:, update(**frame1, **frame2, ...)]

Or, if you need to append just a single column:

df["newcol"] = frame1
Examples
DT = dt.Frame(A=[1, 2, 3], B=[4, 7, 0])
frame1 = dt.Frame(N=[-1, -2, -5])
DT.cbind(frame1)
DT
ABN
▪▪▪▪▪▪▪▪▪▪▪▪
014−1
127−2
230−5
See also
  • dt.cbind() – function for cbinding frames “out-of-place” instead of in-place;

  • rbind() – method for row-binding frames.

datatable.Frame.colindex()

Return the position of the column in the Frame.

The index of the first column is 0, just as with regular python lists.

Parameters
column
str | int | FExpr

If string, then this is the name of the column whose index you want to find.

If integer, then this represents a column’s index. The return value is thus the same as the input argument column, provided that it is in the correct range. If the column argument is negative, then it is interpreted as counting from the end of the frame. In this case the positive value column + ncols is returned.

Lastly, column argument may also be an f-expression such as f.A or f[3]. This case is treated as if the argument was simply "A" or 3. More complicated f-expressions are not allowed and will result in a TypeError.

return
int

The numeric index of the provided column. This will be an integer between 0 and self.ncols - 1.

except
KeyError | IndexError

If the column argument is a string, and the column with such name does not exist in the frame, then a KeyError is raised. When this exception is thrown, the error message may contain suggestions for up to 3 similarly looking column names that actually exist in the Frame.

If the column argument is an integer that is either greater than or equal to ncols or less than -ncols, then an IndexError is raised.

Examples
df = dt.Frame(A=[3, 14, 15], B=["enas", "duo", "treis"],
              C=[0, 0, 0])
df.colindex("B")
1
df.colindex(-1)
2
from datatable import f
df.colindex(f.A)
0
datatable.Frame.copy()

Make a copy of the frame.

The returned frame will be an identical copy of the original, including column names, types, and keys.

By default, copying is shallow with copy-on-write semantics. This means that only the minimal information about the frame is copied, while all the internal data buffers are shared between the copies. Nevertheless, due to the copy-on-write semantics, any changes made to one of the frames will not propagate to the other; instead, the data will be copied whenever the user attempts to modify it.

It is also possible to explicitly request a deep copy of the frame by setting the parameter deep to True. With this flag, the returned copy will be truly independent from the original. The returned frame will also be fully materialized in this case.

Parameters
deep
bool

Flag indicating whether to return a “shallow” (default), or a “deep” copy of the original frame.

return
Frame

A new Frame, which is the copy of the current frame.

Examples
DT1 = dt.Frame(range(5))
DT2 = DT1.copy()
DT2[0, 0] = -1
DT2.to_list()
[[-1, 1, 2, 3, 4]]
DT1.to_list()
[[0, 1, 2, 3, 4]]
Notes
  • Non-deep frame copy is a very low-cost operation: its speed depends on the number of columns only, not on the number of rows. On a regular laptop copying a 100-column frame takes about 30-50µs.

  • Deep copying is more expensive, since the data has to be physically written to new memory, and if the source columns are virtual, then they need to be computed too.

  • Another way to create a copy of the frame is using a DT[i, j] expression (however, this will not copy the key property):

    DT[:, :]
    
  • Frame class also supports copying via the standard Python library copy:

    import copy
    DT_shallow_copy = copy.copy(DT)
    DT_deep_copy = copy.deepcopy(DT)
    
datatable.Frame.export_names()
New in version 0.10

Return a tuple of f-expressions for all columns of the frame.

For example, if the frame has columns “A”, “B”, and “C”, then this method will return a tuple of expressions (f.A, f.B, f.C). If you assign these to, say, variables A, B, and C, then you will be able to write column expressions using the column names directly, without using the f symbol:

A, B, C = DT.export_names()
DT[A + B > C, :]

The variables that are “exported” refer to each column by name. This means that you can use the variables even after reordering the columns. In addition, the variables will work not only for the frame they were exported from, but also for any other frame that has columns with the same names.

Parameters
return
Tuple[Expr, ...]

The length of the tuple is equal to the number of columns in the frame. Each element of the tuple is a datatable expression, and can be used primarily with the DT[i,j] notation.

Notes
  • This method is effectively equivalent to:

    def export_names(self):
        return tuple(f[name] for name in self.names)
    
  • If you want to export only a subset of column names, then you can either subset the frame first, or use *-notation to ignore the names that you do not plan to use:

    A, B = DT[:, :2].export_names()  # export the first two columns
    A, B, *_ = DT.export_names()     # same
    
  • Variables that you use in code do not have to have the same names as the columns:

    Price, Quantity = DT[:, ["sale price", "quant"]].export_names()
    
datatable.Frame.head()

Return the first n rows of the frame.

If the number of rows in the frame is less than n, then all rows are returned.

This is a convenience function and it is equivalent to DT[:n, :].

Parameters
n
int

The maximum number of rows to return, 10 by default. This number cannot be negative.

return
Frame

A frame containing the first up to n rows from the original frame, and same columns.

Examples
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates",
                 "eggplants", "figs", "grapes", "kiwi"])
DT.head(4)
A
▪▪▪▪
0apples
1bananas
2cherries
3dates
See also
  • tail() – return the last n rows of the Frame.

datatable.Frame.key

The tuple of column names that are the primary key for this frame.

If the frame has no primary key, this property returns an empty tuple.

The primary key columns are always located at the beginning of the frame, and therefore the following holds:

DT.key == DT.names[:len(DT.key)]

Assigning to this property will make the Frame keyed by the specified column(s). The key columns will be moved to the front, and the Frame will be sorted. The values in the key columns must be unique.

Parameters
return
Tuple[str, ...]

When used as a getter, returns the tuple of names of the primary key columns.

newkey
str | List[str] | Tuple[str, ...] | None

Specify a column or a list of columns that will become the new primary key of the Frame. Object columns cannot be used for a key. The values in the key column must be unique; if multiple columns are assigned as the key, then their combined (tuple-like) values must be unique.

If newkey is None, then this is equivalent to deleting the key. When the key is deleted, the key columns remain in the frame, they merely stop being marked as “key”.

except
ValueError

Raised when the values in the key column(s) are not unique.

except
KeyError

Raised when newkey contains a column name that doesn’t exist in the Frame.

Examples
DT = dt.Frame(A=range(5), B=['one', 'two', 'three', 'four', 'five'])
DT.key = 'B'
DT
BA
▪▪▪▪▪▪▪▪
five4
four3
one0
three2
two1
datatable.Frame.keys()

Returns a tuple of column names, same as names property.

This method is not intended for public use. It is needed in order for Frame to satisfy Python’s Mapping interface.

datatable.Frame.ltypes

The tuple of each column’s ltypes (“logical types”).

Parameters
return
Tuple[ltype, ...]

The length of the tuple is the same as the number of columns in the frame.

See also
  • stypes – tuple of columns’ storage types

datatable.Frame.materialize()

Force all data in the Frame to be laid out physically.

In datatable, a Frame may contain “virtual” columns, i.e. columns whose data is computed on-the-fly. This allows us to have better performance for certain types of computations, while also reduce the total memory footprint. The use of virtual columns is generally transparent to the user, and datatable will materialize them as needed.

However, there could be situations where you might want to materialize your Frame explicitly. In particular, materialization will carry out all delayed computations and break internal references on other Frames’ data. Thus, for example if you subset a large frame to create a smaller subset, then the new frame will carry an internal reference to the original, preventing it from being garbage-collected. However, if you materialize the small frame, then the data will be physically copied, allowing the original frame’s memory to be freed.

Parameters
to_memory
bool

If True, then, in addition to de-virtualizing all columns, this method will also copy all memory-mapped columns into the RAM.

When you open a Jay file, the Frame that is created will contain memory-mapped columns whose data still resides on disk. Calling .materialize(to_memory=True) will force the data to be loaded into the main memory. This may be beneficial if you are concerned about the disk speed, or if the file is on a removable drive, or if you want to delete the source file.

return
None

This operation modifies the frame in-place.

datatable.Frame.names

The tuple of names of all columns in the frame.

Each name is a non-empty string not containing any ASCII control characters, and jointly the names are unique within the frame.

This property is also assignable: setting DT.names has the effect of renaming the frame’s columns without changing their order. When renaming, the length of the new list of names must be the same as the number of columns in the frame. It is also possible to rename just a few of the columns by assigning a dictionary {oldname: newname}. Any column not listed in the dictionary will keep its old name.

When setting new column names, we will verify whether they satisfy the requirements mentioned above. If not, a warning will be emitted and the names will be automatically mangled.

Parameters
return
Tuple[str, ...]

When used in getter form, this property returns the names of all frame’s columns, as a tuple. The length of the tuple is equal to the number of columns in the frame, ncols.

newnames
List[str?] | Tuple[str?, ...] | Dict[str, str?] | None

The most common form is to assign the list or tuple of new column names. The length of the new list must be equal to the number of columns in the frame. Some (or all) elements in the list may be None’s, indicating that that column should have an auto-generated name.

If newnames is a dictionary, then it provides a mapping from old to new column names. The dictionary may contain less entries than the number of columns in the frame: the columns not mentioned in the dictionary will retain their names.

Setting the .names to None is equivalent to using the del keyword: the names will be set to their default values, which are usually C0, C1, ....

except
ValueError

If the length of the list/tuple newnames does not match the number of columns in the frame.

except
KeyError

If newnames is a dictionary containing entries that do not match any of the existing columns in the frame.

Examples
DT = dt.Frame([[1], [2], [3]])
DT.names = ['A', 'B', 'C']
DT.names
('A', 'B', 'C')
DT.names = {'B': 'middle'}
DT.names
('A', 'middle', 'C')
del DT.names
DT.names
('C0', 'C1', 'C2)
datatable.Frame.ncols

Number of columns in the frame.

Parameters
return
int

The number of columns can be either zero or a positive integer.

Notes

The expression len(DT) also returns the number of columns in the frame DT. Such usage, however, is not recommended.

See also
  • nrows: getter for the number of rows of the frame.

datatable.Frame.nrows

Number of rows in the Frame.

Assigning to this property will change the height of the Frame, either by truncating if the new number of rows is smaller than the current, or filling with NAs if the new number of rows is greater.

Increasing the number of rows of a keyed Frame is not allowed.

Parameters
return
int

The number of rows can be either zero or a positive integer.

n
int

The new number of rows for the frame, this should be a nonnegative integer.

See also
  • ncols: getter for the number of columns of the frame.

datatable.Frame.rbind()

Append rows of frames to the current frame.

This is equivalent to list.extend() in Python: the frames are combined by rows, i.e. rbinding a frame of shape [n x k] to a Frame of shape [m x k] produces a frame of shape [(m + n) x k].

This method modifies the current frame in-place. If you do not want the current frame modified, then use the rbind() function.

If frame(s) being appended have columns of types different from the current frame, then these columns will be promoted to the largest of their types: bool -> int -> float -> string.

If you need to append multiple frames, then it is more efficient to collect them into an array first and then do a single rbind(), than it is to append them one-by-one in a loop.

Appending data to a frame opened from disk will force loading the current frame into memory, which may fail with an OutOfMemory exception if the frame is sufficiently big.

Parameters
frames
Frame | List[Frame]

One or more frame to append. These frames should have the same columnar structure as the current frame (unless option force is used).

force
bool

If True, then the frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs.

bynames
bool

If True (default), the columns in frames are matched by their names. For example, if one frame has columns [“colA”, “colB”, “colC”] and the other [“colB”, “colA”, “colC”] then we will swap the order of the first two columns of the appended frame before performing the append. However if bynames is False, then the column names will be ignored, and the columns will be matched according to their order, i.e. i-th column in the current frame to the i-th column in each appended frame.

return
None
datatable.Frame.replace()

Replace given value(s) replace_what with replace_with in the entire Frame.

For each replace value, this method operates only on columns of types appropriate for that value. For example, if replace_what is a list [-1, math.inf, None, "??"], then the value -1 will be replaced in integer columns only, math.inf only in real columns, None in columns of all types, and finally "??" only in string columns.

The replacement value must match the type of the target being replaced, otherwise an exception will be thrown. That is, a bool must be replaced with a bool, an int with an int, a float with a float, and a string with a string. The None value (representing NA) matches any column type, and therefore can be used as either replacement target, or replace value for any column. In particular, the following is valid: DT.replace(None, [-1, -1.0, ""]). This will replace NA values in int columns with -1, in real columns with -1.0, and in string columns with an empty string.

The replace operation never causes a column to change its logical type. Thus, an integer column will remain integer, string column remain string, etc. However, replacing may cause a column to change its stype, provided that ltype remains constant. For example, replacing 0 with -999 within an int8 column will cause that column to be converted into the int32 stype.

Parameters
replace_what
None | bool | int | float | list | dict

Value(s) to search for and replace.

replace_with
single value | list

The replacement value(s). If replace_what is a single value, then this must be a single value too. If replace_what is a list, then this could be either a single value, or a list of the same length. If replace_what is a dict, then this value should not be passed.

return
None

Nothing is returned, the replacement is performed in-place.

Examples
df = dt.Frame([1, 2, 3] * 3)
df.replace(1, -1)
df.to_list()
[[-1, 2, 3, -1, 2, 3, -1, 2, 3]]
df.replace({-1: 100, 2: 200, "foo": None})
df.to_list()
[[100, 200, 3, 100, 200, 3, 100, 200, 3]]
datatable.Frame.shape

Tuple with (nrows, ncols) dimensions of the frame.

This property is read-only.

Parameters
return
Tuple[int, int]

Tuple with two integers: the first is the number of rows, the second is the number of columns.

See also
  • nrows – getter for the number of rows;

  • ncols – getter for the number of columns.

datatable.Frame.sort()

Sort frame by the specified column(s).

Parameters
cols
List[str | int]

Names or indices of the columns to sort by. If no columns are given, the Frame will be sorted on all columns.

return
Frame

New Frame sorted by the provided column(s). The current frame remains unmodified.

datatable.Frame.source
New in version 0.11

The name of the file where this frame was loaded from.

This is a read-only property that describes the origin of the frame. When a frame is loaded from a Jay or CSV file, this property will contain the name of that file. Similarly, if the frame was opened from a URL or a from a shell command, the source will report the original URL / the command.

Certain sources may be converted into a Frame only partially, in such case the source property will attempt to reflect this fact. For example, when opening a multi-file zip archive, the source will contain the name of the file within the archive. Similarly, when opening an XLS file with several worksheets, the source property will contain the name of the XLS file, the name of the worksheet, and possibly even the range of cells that were read.

Parameters
return
str | None

If the frame was loaded from a file or similar resource, the name of that file is returned. If the frame was computed, or its data modified, the property will return None.

datatable.Frame.stype
New in version v0.10.0

The common stype for all columns.

This property is well-defined only for frames where all columns have the same stype.

Parameters
return
stype | None

For frames where all columns have the same stype, this common stype is returned. If a frame has 0 columns, None will be returned.

except
InvalidOperationError

This exception will be raised if the columns in the frame have different stypes.

See also
  • stypes – tuple of stypes for all columns.

datatable.Frame.stypes

The tuple of each column’s stypes (“storage types”).

Parameters
return
Tuple[stype, ...]

The length of the tuple is the same as the number of columns in the frame.

See also
  • stype – common stype for all columns

  • ltypes – tuple of columns’ logical types

datatable.Frame.tail()

Return the last n rows of the frame.

If the number of rows in the frame is less than n, then all rows are returned.

This is a convenience function and it is equivalent to DT[-n:, :] (except when n is 0).

Parameters
n
int

The maximum number of rows to return, 10 by default. This number cannot be negative.

return
Frame

A frame containing the last up to n rows from the original frame, and same columns.

Examples
DT = dt.Frame(A=["apples", "bananas", "cherries", "dates",
                 "eggplants", "figs", "grapes", "kiwi"])
DT.tail(3)
A
▪▪▪▪
0figs
1grapes
2kiwi
See also
  • head() – return the first n rows of the Frame.

datatable.Frame.to_csv()
to_csv
(
path=None
,
*
,
quoting="minimal"
,
append=False
,
header="auto"
,
bom=False
,
hex=False
, ,
verbose=False
,
method="auto"
)

Write the contents of the Frame into a CSV file.

This method uses multiple threads to serialize the Frame’s data. The number of threads is can be configured using the global option dt.options.nthreads.

The method supports simple writing to file, appending to an existing file, or creating a python string if no filename was provided. Optionally, the output could be gzip-compressed.

Parameters
path
str

Path to the output CSV file that will be created. If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.

quoting
csv.QUOTE_* | "minimal" | "all" | "nonnumeric" | "none"
"minimal" | csv.QUOTE_MINIMAL

quote the string fields only as necessary, i.e. if the string starts or ends with the whitespace, or contains quote characters, separator, or any of the C0 control characters (including newlines, etc).

"all" | csv.QUOTE_ALL

all fields will be quoted, both string, numeric, and boolean.

"nonnumeric" | csv.QUOTE_NONNUMERIC

all string fields will be quoted.

"none" | csv.QUOTE_NONE

none of the fields will be quoted. This option must be used at user’s own risk: the file produced may not be valid CSV.

append
bool

If True, the file given in the path parameter will be opened for appending (i.e. mode=”a”), or created if it doesn’t exist. If False (default), the file given in the path will be overwritten if it already exists.

bom
bool

If True, then insert the byte-order mark into the output file (the option is False by default). Even if the option is True, the BOM will not be written when appending data to an existing file.

According to Unicode standard, including BOM into text files is “neither required nor recommended”. However, some programs (e.g. Excel) may not be able to recognize file encoding without this mark.

hex
bool

If True, then all floating-point values will be printed in hex format (equivalent to %a format in C printf). This format is around 3 times faster to write/read compared to usual decimal representation, so its use is recommended if you need maximum speed.

compression
None | "gzip" | "auto"

Which compression method to use for the output stream. The default is “auto”, which tries to infer the compression method from the output file’s name. The only compression format currently supported is “gzip”. Compression may not be used when append is True.

verbose
bool

If True, some extra information will be printed to the console, which may help to debug the inner workings of the algorithm.

method
"mmap" | "write" | "auto"

Which method to use for writing to disk. On certain systems ‘mmap’ gives a better performance; on other OSes ‘mmap’ may not work at all.

return
None | str | bytes

None if path is non-empty. This is the most common case: the output is written to the file provided.

String containing the CSV text as if it would have been written to a file, if the path is empty or None. If the compression is turned on, a bytes object will be returned instead.

datatable.Frame.to_dict()

Convert the frame into a dictionary of lists, by columns.

In Python 3.6+ the order of records in the dictionary will be the same as the order of columns in the frame.

Parameters
return
Dict[str, List]

Dictionary with ncols records. Each record represents a single column: the key is the column’s name, and the value is the list with the column’s data.

Examples
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
DT.to_dict()
{"A": [1, 2, 3], "B": ["aye", "nay", "tain"]}
See also
  • to_list(): convert the frame into a list of lists

  • to_tuples(): convert the frame into a list of tuples by rows

datatable.Frame.to_jay()
to_jay
(
path=None
,
method='auto'
)

Save this frame to a binary file on disk, in .jay format.

Parameters
path
str | None

The destination file name. Although not necessary, we recommend using extension “.jay” for the file. If the file exists, it will be overwritten. If this argument is omitted, the file will be created in memory instead, and returned as a bytes object.

method
'mmap' | 'write' | 'auto'

Which method to use for writing the file to disk. The “write” method is more portable across different operating systems, but may be slower. This parameter has no effect when path is omitted.

return
None | bytes

If the path parameter is given, this method returns nothing. However, if path was omitted, the return value is a bytes object containing encoded frame’s data.

datatable.Frame.to_list()

Convert the frame into a list of lists, by columns.

Parameters
return
List[List]

A list of ncols lists, each inner list representing one column of the frame.

Examples
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
DT.to_list()
[[1, 2, 3], ["aye", "nay", "tain"]]
dt.Frame(id=range(10)).to_list()
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
datatable.Frame.to_numpy()

Convert frame into a 2D numpy array, optionally forcing it into the specified stype/dtype.

In a limited set of circumstances the returned numpy array will be created as a data view, avoiding copying the data. This happens if all of these conditions are met:

  • the frame has only 1 column, which is not virtual;

  • the column’s type is not string;

  • the stype argument was not used.

In all other cases the returned numpy array will have a copy of the frame’s data. If the frame has multiple columns of different stypes, then the values will be upcasted into the smallest common stype.

If the frame has any NA values, then the returned numpy array will be an instance of numpy.ma.masked_array.

Parameters
stype
stype | numpy.dtype | str | type

Cast frame into this stype before converting it into a numpy array.

column
int

Convert only the specified column; the returned value will be a 1D-array instead of a regular 2D-array.

return
numpy.array
except
ImportError

If the numpy module is not installed.

datatable.Frame.to_pandas()

Convert this frame to a pandas DataFrame.

Parameters
return
pandas.DataFrame
except
ImportError

If the pandas module is not installed.

datatable.Frame.to_tuples()

Convert the frame into a list of tuples, by rows.

Parameters
return
List[Tuple]

Returns a list having nrows tuples, where each tuple has length ncols and contains data from each respective row of the frame.

Examples
DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
DT.to_tuples()
[(1, "aye"), (2, "nay"), (3, "tain")]
datatable.Frame.view()

This function is currently not working properly

datatable.ltype

class
ltype

Enumeration of possible “logical” types of a column.

Logical type is the type stripped away from the details of its physical storage. For example, ltype.int represents an integer. Under the hood, this integer can be stored in several “physical” formats: from stype.int8 to stype.int64. Thus, there is a one-to-many relationship between ltypes and stypes.

Values

The following ltype values are currently available:

  • ltype.bool

  • ltype.int

  • ltype.real

  • ltype.str

  • ltype.time

  • ltype.obj

Methods

ltype(x)

Find ltype corresponding to value x.

.stypes

The list of stypes that correspond to this ltype.

Examples
dt.ltype.bool
ltype.bool
dt.ltype("int32")
ltype.int

For each ltype, you can find the set of stypes that correspond to it:

dt.ltype.real.stypes
[stype.float32, stype.float64]
dt.ltype.time.stypes
[]
datatable.ltype.__new__()

Find an ltype corresponding to value.

This method is similar to stype.__new__(), except that it returns an ltype instead of an stype.

datatable.ltype.stypes()

List of stypes that represent this ltype.

Parameters
return
List[stype]

datatable.Namespace

class
Namespace

A namespace is an environment that provides lazy access to columns of a frame when performing computations within DT[i,j,...].

This class should not be instantiated directly, instead use the singleton instances f and g exported from the datatable module.

Special methods

__getattribute__(attr)

Access columns as attributes.

__getitem__(item)

Access columns by their names / indices.

datatable.Namespace.__getitem__()

Retrieve column(s) by their indices/names/types.

By “retrieve” we actually mean that an expression is created such that when that expression is used within the DT[i,j] call, it would locate and return the specified column(s).

Parameters
item
int | str | slice | None | type | stype | ltype

The column selector:

int

Retrieve the column at the specified index. For example, f[0] denotes the first column, whlie f[-1] is the last.

str

Retrieve a column by name.

slice

Retrieve a slice of columns from the namespace. Both integer and string slices are supported.

None

Retrieve no columns (an empty columnset).

type | stype | ltype

Retrieve columns matching the specified type.

return
FExpr

An expression that selects the specified column from a frame.

See also
datatable.Namespace.__getattribute__()

Retrieve a column from the namespace by name.

This is a convenience form that can be used to access simply-named columns. For example: f.Age denotes a column called "Age", and is exactly equivalent to f['Age'].

Parameters
name
str

Name of the column to select.

return
FExpr

An expression that selects the specified column from a frame.

See also

datatable.stype

class
stype

Enumeration of possible “storage” types of columns in a Frame.

Each column in a Frame is a vector of values of the same type. We call this column’s type the “stype”. Most stypes correspond to primitive C types, such as int32_t or double. However some stypes (corresponding to strings and categoricals) have a more complicated underlying structure.

Notably, datatable does not support arbitrary structures as elements of a Column, so the set of stypes is small.

Values

The following stype values are currently available:

  • stype.bool8

  • stype.int8

  • stype.int16

  • stype.int32

  • stype.int64

  • stype.float32

  • stype.float64

  • stype.str32

  • stype.str64

  • stype.obj64

They are available either as properties of the dt.stype class, or directly as constants in the dt. namespace. For example:

>>> dt.stype.int32
stype.int32
>>> dt.int64
stype.int64
Methods

stype(x)

Find stype corresponding to value x.

<stype>(col)

Cast a column into the specific stype.

.ctype

ctypes type corresponding to this stype.

.dtype

numpy dtype corresponding to this stype.

.ltype

ltype corresponding to this stype.

.struct

struct string corresponding to this stype.

.min

The smallest numeric value for this stype.

.max

The largest numeric value for this stype.

datatable.stype.__call__()

Cast column col into the new stype.

An stype can be used as a function that converts columns into that specific stype. In the same way as you could write int(3.14) in Python to convert a float value into integer, you can likewise write dt.int32(f.A) to convert column A into stype int32.

Parameters
col
FExpr

A single- or multi- column expression. All columns will be converted into the desired stype.

return
FExpr

Expression that converts its inputs into the current stype.

datatable.stype.__new__()

Find an stype corresponding to value.

This method is called when you attempt to construct a new stype object, for example dt.stype(int). Instead of actually creating any new stypes, we return one of the existing stype values.

Parameters
value
str | type | np.dtype

An object that will be converted into an stype. This could be a string such as "integer" or "int" or "int8", a python type such as bool or float, or a numpy dtype.

return
stype

An stype that corresponds to the input value.

except
ValueError

Raised if value does not correspond to any stype.

Examples
dt.stype(str)
stype.str64
dt.stype("double")
stype.float64
dt.stype(numpy.dtype("object"))
stype.obj64
dt.stype("int64")
stype.int64
datatable.stype.ctype

ctypes class that describes the C-level type of each element in a column with this stype.

For non-fixed-width columns (such as str32) this will return the ctype of only the fixed-width component of that column. Thus, stype.str32.ctype == ctypes.c_int32.

datatable.stype.dtype

numpy.dtype object that corresponds to this stype.

datatable.stype.ltype

ltype corresponding to this stype. Several stypes may map to the same ltype, whereas each stype is described by exactly one ltype.

datatable.stype.max

The largest finite value that this stype can represent.

datatable.stype.min

The smallest finite value that this stype can represent.

datatable.stype.struct

struct format string corresponding to this stype.

For non-fixed-width columns (such as str32) this will return the format string of only the fixed-width component of that column. Thus, stype.str32.struct == '=i'.

datatable.build_info

build_info

This is a python struct that contains information about the installed datatable module. The following fields are available:

.version
str

The version string of the current build. Several formats of the version string are possible:

  • {MAJOR}.{MINOR}.{MICRO} – the release version string, such as "0.11.0".

  • {RELEASE}a{DEVNUM} – version string for the development build of datatable, where {RELEASE} is the normal release string and {DEVNUM} is an integer that is incremented with each build. For example: "0.11.0a1776".

  • {RELEASE}a0+{SUFFIX} – version string for a PR build of datatable, where the {SUFFIX} is formed from the PR number and the build sequence number. For example, "0.11.0a0+pr2602.13".

  • {RELEASE}a0+{FLAVOR}.{TIMESTAMP}.{USER} – version string used for local builds. This contains the “flavor” of the build, such as normal build, or debug, or coverage, etc; the unix timestamp of the build; and lastly the system user name of the user who made the build.

.build_date
str

UTC timestamp (date + time) of the build.

.git_revision
str

Git-hash of the revision from which the build was made, as obtained from git rev-parse HEAD.

.git_branch
str

Name of the git branch from where the build was made. This will be obtained from environment variable CHANGE_BRANCH if defined, or from command git rev-parse --abbrev-ref HEAD otherwise.

.git_date
str
New in version 0.11

Timestamp of the git commit from which the build was made.

.git_diff
str
New in version 0.11

If the source tree contains any uncommitted changes (compared to the checked out git revision), then the summary of these changes will be in this field, as reported by git diff HEAD --stat --no-color. Otherwise, this field is an empty string.

datatable.by()

Group-by clause for use in Frame’s square-bracket selector.

Whenever a by() object is present inside a DT[i, j, ...] expression, it makes all other expressions to be evaluated in group-by mode. This mode causes the following changes to the evaluation semantics:

  • A “Groupby” object will be computed for the frame DT, grouping it by columns specified as the arguments to the by() call. This object keeps track of which rows of the frame belong to which group.

  • If an i expression is present (row filter), it will be interpreted within each group. For example, if i is a slice, then the slice will be applied separately to each group. Similarly, if i expression contains a formula with reduce functions, then those functions will be evaluated for each group. For example:

    DT[f.A == max(f.A), :, by(f.group_id)]
    

    will select those rows where column A reaches its peak value within each group (there could be multiple such rows within each group).

  • Before j is evaluated, the by() clause adds all its columns at the start of j (unless add_columns argument is False). If j is a “select-all” slice (i.e. :), then those columns will also be excluded from the list of all columns so that they will be present in the output only once.

  • During evaluation of j, the reducer functions, such as min(), sum(), etc, will be evaluated by-group, that is they will find the minimal value in each group, the sum of values in each group, and so on. If a reducer expression is combined with a regular column expression, then the reduced column will be auto-expanded into a column that is constant within each group.

  • Note that if both i and j contain reducer functions, then those functions will have slightly different notion of groups: the reducers in i will see each group “in full”, whereas the reducers in j will see each group after it was filtered by the expression in i (and possibly not even see some of the groups at all, if they were filtered out completely).

  • If j contains only reducer expressions, then the final result will be a Frame containing containing just a single row for each group. This resulting frame will also be keyed by the grouped-by columns.

The by() function expects a single column or a sequence of columns as the argument(s). It accepts either a column name, or an f-expression. In particular, you can perform a group-by on a dynamically computed expression:

DT[:, :, by(dt.math.floor(f.A/100))]

The default behavior of groupby is to sort the groups in the ascending order, with NA values appearing before any other values. As a special case, if you group by an expression -f.A, then it will be treated as if you requested to group by the column “A” sorting it in the descending order. This will work even with column types that are not arithmetic, for example “A” could be a string column here.

datatable.cbind()

Create a new Frame by appending columns from several frames.

This function is exactly equivalent to:

dt.Frame().cbind(*frames, force=force)
Parameters
frames
Frame | List[Frame] | None
force
bool
return
Frame
See also
  • rbind() – function for row-binding several frames.

  • Frame.cbind() – Frame method for cbinding some frames to another.

datatable.corr()

Calculate the Pearson correlation between col1 and col2.

Parameters
col1
,
col2
Expr

Input columns.

return
Expr

f-expression having one row, one column and the correlation coefficient as the value. If one of the columns is non-numeric, the value is NA. The column stype is float32 if both col1 and col2 are float32, and float64 in all the other cases.

See Also
  • cov() – function to calculate covariance between two columns.

datatable.count()

Calculate the number of non-missing values for each column from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. All the returned column stypes are int64.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric and non-string type.

See Also
  • sum() – function to calculate the sum of values.

datatable.cov()

Calculate covariance between col1 and col2.

Parameters
col1
,
col2
Expr

Input columns.

return
Expr

f-expression having one row, one column and the covariance between col1 and col2 as the value. If one of the input columns is non-numeric, the value is NA. The output column stype is float32 if both col1 and col2 are float32, and float64 in all the other cases.

See Also
  • corr() – function to calculate correlation between two columns.

datatable.cut()

New in version 0.11

Cut all the columns from cols by binning their values into equal-width discrete intervals.

Parameters
cols
FExpr

Input data for equal-width interval binning.

nbins
int | List[int]

When a single number is specified, this number of bins will be used to bin each column from cols. When a list or a tuple is provided, each column will be binned by using its own number of bins. In the latter case, the list/tuple length must be equal to the number of columns in cols.

right_closed
bool

Each binning interval is half-open. This flag indicates which side of the interval is closed.

return
FExpr

f-expression that converts input columns into the columns filled with the respective bin ids.

See also

qcut() – function for equal-population binning.

datatable.dt

This is the datatable module itself.

The purpose of exporting this symbol is so that you can easily import all the things you need from the datatable module in one go:

from datatable import dt, f, g, by, join, mean

Note: while it is possible to write

test = dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('test.jay')
train = dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.dt.fread('train.jay')

we do not in fact recommend doing so (except possibly on April 1st).

datatable.f

The main Namespace object.

The function of this object is that during the evaluation of a DT[i,j] call, the variable f represents the columns of frame DT.

Specifically, within expression DT[i, j] the following is true:

  • f.A means “column A” of frame DT;

  • f[2] means “3rd colum” of frame DT;

  • f[int] means “all integer columns” of DT;

  • f[:] means “all columns” of DT.

See also
  • g – namespace for joined frames.

datatable.first()

Return the first row for each column from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names, stypes and number of columns as in cols.

See Also
  • last() – function that returns the last row.

datatable.fread()

fread
(
,
*
,
file=None
,
text=None
,
cmd=None
,
url=None
,
columns=None
,
sep=None
,
dec="."
, ,
header=None
, ,
verbose=False
,
fill=False
, , , , , , ,
tempdir=None
, ,
logger=None
, , )

This function is capable of reading data from a variety of input formats, producing a Frame as the result. The recognized formats are: CSV, Jay, XLSX, and plain text. In addition, the data may be inside an archive such as .tar, .gz, .zip, .gz2, and .tgz.

Parameters
anysource
str | bytes | file | Pathlike | List

The first (unnamed) argument to fread is the input source. Multiple types of sources are supported, and they can be named explicitly: file, text, cmd, and url. When the source is not named, fread will attempt to guess its type. The most common type is file, but sometimes the argument is resolved as text (if the string contains newlines) or url (if the string starts with https:// or similar).

Only one argument out of anysource, file, text, cmd or url can be specified at once.

file
str | file | Pathlike

A file source can be either the name of the file on disk, or a python “file-like” object – i.e. any object having method .read().

Generally, specifying a file name should be preferred, since reading from a Python file can only be done in single-threaded mode.

This argument also supports addressing files inside an archive, or sheets inside an Excel workbook. Simply write the name of the file as if the archive was a folder: "data.zip/train.csv".

text
str | bytes

Instead of reading data from file, this argument provides the data as a simple in-memory blob.

cmd
str

A command that will be executed in the shell and its output then read as text.

url
str

This parameter can be used to specify the URL of the input file. The data will first be downloaded into a temporary directory and then read from there. In the end the temporary files will be removed.

We use the standard urllib.request module to download the data. Changing the settings of that module, for example installing proxy, password, or cookie managers will allow you to customize the download process.

columns
...

Limit which columns to read from the input file.

sep
str | None

Field separator in the input file. If this value is None (default) then the separator will be auto-detected. Otherwise it must be a single-character string. When sep='\n', then the data will be read in single-column mode. Characters ["'`0-9a-zA-Z] are not allowed as the separator, as well as any non-ASCII characters.

dec
"." | ","

Decimal point symbol for floating-point numbers.

max_nrows
int

The maximum number of rows to read from the file. Setting this parameter to any negative number is equivalent to have no limit at all. Currently this parameter doesn’t always work correctly.

na_strings
List[str]

The list of strings that were used in the input file to represent NA values.

fill
bool

If True then the lines of the CSV file are allowed to have uneven number of fields. All missing fields will be filled with NAs in the resulting frame.

encoding
str | None

If this parameter is provided, then the input will be recoded from this encoding into UTF-8 before reading. Any encoding registered with the python codec module can be used.

skip_to_string
str | None

Start reading the file from the line containing this string. All previous lines will be skipped and discarded. This parameter cannot be used together with skip_to_line.

skip_to_line
int

If this setting is given, then this many lines in the file will be skipped before we start to parse the file. This can be used for example when several first lines in the file contain non-CSV data and therefore must be skipped. This parameter cannot be used together with skip_to_string.

skip_blank_lines
bool

If True then any empty lines in the input will be skipped. If this parameter is False then: (a) in single-column mode empty lines are kept as empty lines; otherwise (b) if fill=True then empty lines produce a single line filled with NAs in the output; otherwise (c) an IOError is raised.

strip_whitespace
bool

If True, then the leading/trailing whitespace will be stripped from unquoted string fields. Whitespace is always skipped from numeric fields.

quotechar
'"' | "'" | "`"

The character that was used to quote fields in the CSV file. By default the double-quote mark '"' is assumed.

tempdir
str | None

Use this directory for storing temporary files as needed. If not provided then the system temporary directory will be used, as determined via the tempfile Python module.

nthreads
int | None

Number of threads to use when reading the file. This number cannot exceed the number of threads in the pool dt.options.nthreads. If 0 or negative number of threads is requested, then it will be treated as that many threads less than the maximum. By default all threads in the thread pool are used.

verbose
bool

If True, then print detailed information about the internal workings of fread to stdout (or to logger if provided).

logger
object

Logger object that will receive verbose information about fread’s progress. When this parameter is specified, verbose mode will be turned on automatically.

multiple_sources
"warn" | "error" | "ignore"

Action that should be taken when the input resolves to multiple distinct sources. By default ("warn") a warning will be issued and only the first source will be read and returned as a Frame. The "ignore" action is similar, except that the extra sources will be discarded without a warning. Lastly, an IOError can be raised if the value of this parameter is "error".

If you want all sources to be read instead of only the first one then consider using iread().

memory_limit
int

Try not to exceed this amount of memory allocation (in bytes) when reading the data. This limit is advisory and not enforced very strictly.

This setting is useful when reading data from a file that is substantially larger than the amount of RAM available on your machine.

When this parameter is specified and fread sees that it needs more RAM than the limit in order to read the input file, then it will dump the data that was read so far into a temporary file in binary format. In the end the returned Frame will be partially composed from data located on disk, and partially from the data in memory. It is advised to either store this data as a Jay file or filter and materialize the frame (if not the performance may be slow).

return
Frame

A single Frame object is always returned.

Changed in version 0.11.0: Previously a dict of Frames was returned when multiple input sources were provided.

except
IOError
See Also

datatable.g

Secondary Namespace object.

The function of this object is that during the evaluation of a DT[..., join(X)] call, the variable g represents the columns of the joined frame X. In SQL this would have been equivalent to ... JOIN tableX AS g ....

See also
  • f – main column namespace.

datatable.init_styles()

init_styles
(
)

Inject datatable’s stylesheets into the Jupyter notebook. This function does nothing when it runs in a normal Python environment outside of Jupyter.

When datatable runs in a Jupyter notebook, it renders its Frames as HTML tables. The appearance of these tables is enhanced using a custom stylesheet, which must be injected into the notebook at any point on the page. This is exactly what this function does.

Normally, this function is called automatically when datatable is imported. However, in some circumstances Jupyter erases these stylesheets (for example, if you run import datatable cell twice). In such cases, you may need to call this method manually.

datatable.ifelse()

Produce a column that chooses one of the two values based on the condition.

This function will only compute those values that are needed for the result. Thus, for each row we will evaluate either expr_if_true or expr_if_false (based on the condition value) but not both. This may be relevant for those cases

Parameters
condition
FExpr

An expression yielding a single boolean column.

expr_if_true
FExpr

Values that will be used when the condition evaluates to True. This must be a single column.

expr_if_false
FExpr

Values that will be used when the condition evaluates to False. This must be a single column.

return
FExpr

The resulting expression is a single column whose stype is the stype which is common for expr_if_true and expr_if_false, i.e. it is the smallest stype into which both exprs can be upcasted.

datatable.intersect()

Find the intersection of sets of values in the frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the intersection operation on these sets, returning those values that are present in each of the provided frames.

Parameters
*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns in the frames.

except
ValueError

The exception is raised when one of the input frames has more than one column.

except
NotImplementedError

The exception is raised when one of the frame columns has stype obj64.

See Also
  • setdiff() – calculate the set difference between the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • union() – calculate the union of values in the frames.

  • unique() – find unique values in a frame.

datatable.iread()

iread
(
,
*
,
file=None
,
text=None
,
cmd=None
,
url=None
,
columns=None
,
sep=None
,
dec="."
, ,
header=None
, ,
verbose=False
,
fill=False
, , , , , , ,
tempdir=None
, ,
logger=None
,
errors="warn"
, )

This function is similar to fread(), but allows reading multiple sources at once. For example, this can be used when the input is a list of files, or a glob pattern, or a multi-file archive, or multi-sheet XLSX file, etc.

Parameters
...
...

Most parameters are the same as in fread(). All parse parameters will be applied to all input files.

errors
"warn" | "raise" | "ignore" | "store"

What action to take when one of the input sources produces an error. Possible actions are: "warn" – each error is converted into a warning and emitted to user, the source that produced the error is then skipped; "raise" – the errors are raised immediately and the iteration stops; "ignore" – the erroneous sources are silently ignored; "store" – when an error is raised, it is captured and returned to the user, then the iterator continues reading the subsequent sources.

return
Iterator[Frame] | Iterator[Frame|Exception]

The returned object is an iterator that produces Frame s. The iterator is lazy: each frame is read only as needed, after the previous frame was “consumed” by the user. Thus, the user can interrupt the iterator without having to read all the frames.

Each Frame produced by the iterator has a .source attribute that describes the source of each frame as best as possible. Each source depends on the type of the input: either a file name, or a URL, or the name of the file in an archive, etc.

If the errors parameter is "store" then the iterator may produce either Frames or exception objects.

See Also

datatable.join()

Join clause for use in Frame’s square-bracket selector.

This clause is equivalent to the SQL JOIN, though for the moment datatable only supports left outer joins. In order to join, the frame must be keyed first, and then joined to another frame DT as

DT[:, :, join(X)]

provided that DT has the column(s) with the same name(s) as the key in frame.

Parameters
frame
Frame

An input keyed frame to be joined to the current one.

return
Join Object

In most of the cases the returned object is directly used in the Frame’s square-bracket selector.

except
TypeError

The exception is raised if the input frame is missing.

except
ValueError

The exception is raised if frame is not keyed.

datatable.last()

Return the last row for each column from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names, stypes and number of columns as in cols.

See Also
  • first() – function that returns the first row.

datatable.max()

Calculate the maximum value for each column from cols. It is recommended to use it as dt.max() to prevent conflict with the Python built-in max() function.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row and the same names, stypes and number of columns as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also
  • min() – function to calculate minimum values.

datatable.mean()

Calculate the mean value for each column from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. The column stypes are float32 for float32 columns, and float64 for all the other numeric types.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also
  • median() – function to calculate median values.

  • sd() – function to calculate standard deviation.

datatable.median()

Calculate the median value for each column from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names, stypes and number of columns as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also
  • mean() – function to calculate mean values.

  • sd() – function to calculate standard deviation.

datatable.min()

Calculate the minimum value for each column from cols. It is recommended to use it as dt.min() to prevent conflict with the Python built-in min() function.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row and the same names, stypes and number of columns as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also
  • max() – function to calculate maxium values.

datatable.qcut()

New in version 0.11

Bin all the columns from cols into intervals with approximately equal populations. Thus, the intervals are chosen according to the sample quantiles of the data.

If there are duplicate values in the data, they will all be placed into the same bin. In extreme cases this may cause the bins to be highly unbalanced.

Parameters
cols
FExpr

Input data for quantile binning.

nquantiles
int | List[int]

When a single number is specified, this number of quantiles will be used to bin each column from cols.

When a list or a tuple is provided, each column will be binned by using its own number of quantiles. In the latter case, the list/tuple length must be equal to the number of columns in cols.

return
FExpr

f-expression that converts input columns into the columns filled with the respective quantile ids.

See also

cut() – function for equal-width interval binning.

datatable.rowall()

For each row in cols return True if all values in that row are True, or otherwise return False.

Parameters
cols
Expr

Input boolean columns.

return
Expr

f-expression consisting of one boolean column that has the same number of rows as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-boolean type.

See Also

datatable.rowany()

For each row in cols return True if any of the values in that row are True, or otherwise return False. The function uses shortcut evaluation: if the True value is found in one of the columns, then the subsequent columns are skipped.

Parameters
cols
Expr

Input boolean columns.

return
Expr

f-expression consisting of one boolean column that has the same number of rows as in cols.

except
TypeError

The exception is raised when one of the columns from cols has a non-boolean type.

See Also

datatable.rowcount()

For each row, count the number of non-missing values in cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one int32 column and the same number of rows as in cols.

See Also
  • rowsum() – sum of all values row-wise.

datatable.rowfirst()

For each row, find the first non-missing value in cols. If all values in a row are missing, then this function will also produce a missing value.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one column and the same number of rows as in cols.

except
TypeError

The exception is raised when input columns have incompatible types.

See Also
  • rowlast() – find the last non-missing value row-wise.

datatable.rowlast()

For each row, find the last non-missing value in cols. If all values in a row are missing, then this function will also produce a missing value.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one column and the same number of rows as in cols.

except
TypeError

The exception is raised when input columns have incompatible types.

See Also
  • rowfirst() – find the first non-missing value row-wise.

datatable.rowmax()

For each row, find the largest value among the columns from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is the smallest common stype for cols, but not less than int32.

except
TypeError

The exception is raised when cols has non-numeric columns.

See Also
  • rowmin() – find the smallest element row-wise.

datatable.rowmean()

For each row, find the mean values among the columns from cols skipping missing values. If a row contains only the missing values, this function will produce a missing value too.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is float32 when all the cols are float32, and float64 in all the other cases.

except
TypeError

The exception is raised when cols has non-numeric columns.

See Also
  • rowsd() – calculate the standard deviation row-wise.

datatable.rowmin()

For each row, find the smallest value among the columns from cols, excluding missing values.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is the smallest common stype for cols, but not less than int32.

except
TypeError

The exception is raised when cols has non-numeric columns.

See Also
  • rowmax() – find the largest element row-wise.

datatable.rowsd()

For each row, find the standard deviation among the columns from cols skipping missing values. If a row contains only the missing values, this function will produce a missing value too.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one column that has the same number of rows as in cols. The column stype is float32 when all the cols are float32, and float64 in all the other cases.

except
TypeError

The exception is raised when cols has non-numeric columns.

See Also
  • rowmean() – calculate the mean value row-wise.

datatable.rowsum()

For each row, calculate the sum of all values in cols. Missing values are treated as if they are zeros and skipped during the calcultion.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression consisting of one column and the same number of rows as in cols. The stype of the resulting column will be the smallest common stype calculated for cols, but not less than int32.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also
  • rowcount() – count non-missing values row-wise.

datatable.rbind()

Produce a new frame by appending rows of frames.

This function is equivalent to:

dt.Frame().rbind(*frames, force=force, by_names=by_names)
Parameters
frames
Frame | List[Frame] | None
force
bool
by_names
bool
return
Frame
See also
  • cbind() – function for col-binding several frames.

  • Frame.rbind() – Frame method for rbinding some frames to another.

datatable.repeat()

Concatenate n copies of the frame by rows and return the result.

This is equivalent to dt.rbind([frame] * n).

datatable.sd()

Calculate the standard deviation for each column from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. The column stypes are float32 for float32 columns, and float64 for all the other numeric types.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also
  • mean() – function to calculate mean values.

  • median() – function to calculate median values.

datatable.setdiff()

Find the set difference between frame0 and the other frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will compute the set difference between the frame0 and the union of the other frames, returning those values that are present in the frame0, but not present in any of the frames.

Parameters
frame0
Frame

Input single-column frame.

*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns from the frames.

except
ValueError

The exception is raised when one of the input frames, i.e. frame0 or any from the frames, has more than one column.

except
NotImplementedError

The exception is raised when one frame columns has stype obj64.

See Also
  • intersect() – calculate the set intersection of values in the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • union() – calculate the union of values in the frames.

  • unique() – find unique values in a frame.

datatable.shift()

Produce a column obtained from col shifting it n rows forward.

The shift amount, n, can be both positive and negative. If positive, a “lag” column is created, if negative it will be a “lead” column.

The shifted column will have the same number of rows as the original column, with n observations in the beginning becoming missing, and n observations at the end discarded.

This function is group-aware, i.e. in the presence of a groupby it will perform the shift separately within each group.

datatable.sort()

Sort clause for use in Frame’s square-bracket selector.

When a sort() object is present inside a DT[i, j, ...] expression, it will sort the rows of the resulting Frame according to the columns cols passed as the arguments to sort().

When used together with by(), the sort clause applies after the group-by, i.e. we sort elements within each group. Note, however, that because we use stable sorting, the operations of grouping and sorting are commutative: the result of applying groupby and then sort is the same as the result of sorting first and then doing groupby.

When used together with i (row filter), the i filter is applied after the sorting. For example,:

DT[:10, :, sort(f.Highscore, reverse=True)]

will select the first 10 records from the frame DT ordered by the Highscore column.

datatable.split_into_nhot()

split_into_nhot
(
,
sep=","
,
sort=False
)

Split and nhot-encode a single-column frame.

Each value in the frame, having a single string column, is split according to the provided separator sep, the whitespace is trimmed, and the resulting pieces (labels) are converted into the individual columns of the output frame.

Parameters
frame
Frame

An input single-column frame. The column stype must be either str32 or str64.

sep
str

Single-character separator to be used for splitting.

sort
bool

An option to control whether the resulting column names, i.e. labels, should be sorted. If set to True, the column names are returned in alphabetical order, otherwise their order is not guaranteed due to the algorithm parallelization.

return
Frame

The output frame. It will have as many rows as the input frame, and as many boolean columns as there were unique labels found. The labels will also become the output column names.

except
ValueError

The exception is raised if the input frame is missing or it has more than one column. It is also raised if sep is not a single-character string.

except
TypeError

The exception is raised if the single column of frame has a type different from string.

Examples
DT = dt.Frame(["cat,dog", "mouse", "cat,mouse", "dog,rooster", "mouse,dog,cat"])
C0
▪▪▪▪
0cat,dog
1mouse
2cat,mouse
3dog,rooster
4mouse,dog,cat
split_into_nhot(DT)
catdogmouserooster
01100
10010
21010
30101
41110

datatable.symdiff()

Find the symmetric difference between the sets of values in all frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the symmetric difference operation on these sets.

The symmetric difference of two frames are those values that are present in either of the frames, but not in the both. The symmetric difference of more than two frames are those values that are present in an odd number of frames.

Parameters
*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns from the frames.

except
ValueError

The exception is raised when one of the input frames has more than one column.

except
NotImplementedError

The exception is raised when one of the frame columns has stype obj64.

See Also
  • intersect() – calculate the set intersection of values in the frames.

  • setdiff() – calculate the set difference between the frames.

  • union() – calculate the union of values in the frames.

  • unique() – find unique values in a frame.

datatable.sum()

Calculate the sum of values for each column from cols.

Parameters
cols
Expr

Input columns.

return
Expr

f-expression having one row, and the same names and number of columns as in cols. The column stypes are int64 for boolean and integer columns, float32 for float32 columns and float64 for float64 columns.

except
TypeError

The exception is raised when one of the columns from cols has a non-numeric type.

See Also
  • count() – function to calculate a number of non-missing values.

datatable.union()

Find the union of values in all frames.

Each frame should have only a single column or be empty. The values in each frame will be treated as a set, and this function will perform the union operation on these sets.

The dt.union(*frames) operation is equivalent to dt.unique(dt.rbind(*frames)).

Parameters
*frames
Frame | Frame | ...

Input single-column frames.

return
Frame

A single-column frame. The column stype is the smallest common stype of columns in the frames.

except
ValueError

The exception is raised when one of the input frames has more than one column.

except
NotImplementedError

The exception is raised when one of the frame columns has stype obj64.

See Also
  • intersect() – calculate the set intersection of values in the frames.

  • setdiff() – calculate the set difference between the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • unique() – find unique values in a frame.

datatable.unique()

Find the unique values in all the columns of the frame.

This function sorts the values in order to find the uniques, so the return values will be ordered. However, this should be considered an implementation detail: in the future datatable may switch to a different algorithm, such as hash-based, which may return the results in a different order.

Parameters
frame
Frame

Input frame.

return
Frame

A single-column frame consisting of unique values found in frame. The column stype is the smallest common stype for all the frame columns.

except
NotImplementedError

The exception is raised when one of the frame columns has stype obj64.

See Also
  • intersect() – calculate the set intersection of values in the frames.

  • setdiff() – calculate the set difference between the frames.

  • symdiff() – calculate the symmetric difference between the sets of values in the frames.

  • union() – calculate the union of values in the frames.

datatable.update()

Create new or update existing columns within a frame.

This expression is intended to be used at “j” place in DT[i, j] call. It takes an arbitrary number of key/value pairs each describing a column name and the expression for how that column has to be created/updated.

Development

Creating a new FExpr

The majority of functions available from datatable module are implemented via the FExpr mechanism. These functions have the same common API: they accept one or more FExprs (or fexpr-like objects) as arguments and produce an FExpr as the output. The resulting FExprs can then be used inside the DT[...] call to apply these expressions to a particular frame.

In this document we describe how to create such FExpr-based function. In particular, we describe adding the gcd(a, b) function for computing the greatest common divisor of two integers.

C++ “backend” class

The core of the functionality will reside within a class derived from the class dt::expr::FExpr. So let’s create the file expr/fexpr_gcd.cc and declare the skeleton of our class:

#include "expr/fexpr_func.h"
#include "expr/eval_context.h"
#include "expr/workframe.h"
namespace dt {
namespace expr {

class FExpr_Gcd : public FExpr_Func {
  private:
    ptrExpr a_;
    ptrExpr b_;

  public:
    FExpr_Gcd(ptrExpr&& a, ptrExpr&& b)
      : a_(std::move(a)), b_(std::move(b)) {}

    std::string repr() const override;
    Workframe evaluate_n(EvalContext& ctx) const override;
};

}}

In this example we are inheriting from FExpr_Func, which is a slightly more specialized version of FExpr.

You can also see that the two arguments in gcd(a, b) are stored within the class as ptrExpr a_, b_. This ptrExpr is actually a typedef for std::shared_ptr<FExpr>, which means that arguments to our FExpr are also FExprs.

The first method that needs to be implemented is repr(), which is more-or-less equivalent to python’s __repr__. The returned string should not have the name of the class in it, instead it must be ready to be combined with reprs of other expressions:

std::string repr() const override {
  std::string out = "gcd(";
  out += a_->repr();
  out += ", ";
  out += b_->repr();
  out += ')';
  return out;
}

We construct our repr out of reprs of a_ and b_. They are joined with a comma, which has the lowest precedence in python. For some other FExprs we may need to take into account the precedence of the arguments as well, in order to properly set up parentheses around subexpressions.

The second method to implement is evaluate_n(). The _n suffix here stands for “normal”. If you look into the source of FExpr class, you’ll see that there are other evaluation methods too: evaluate_i(), evaluate_j(), etc. However all of those are not needed when implementing a simple function.

The method evaluate_n() takes an EvalContext object as the argument. This object contains information about the current evaluation environment. The output from evaluate_n() should be a Workframe object. A workframe can be thought of as a “work-in-progress” frame. In our case it is sufficient to treat it as a simple vector of columns.

We begin implementing evaluate_n() by evaluating the arguments a_ and b_ and then making sure that those frames are compatible with each other (i.e. have the same number of columns and rows). After that we compute the result by iterating through the columns of both frames and calling a simple method evaluate1(Column&&, Column&&) (that we still need to implement):

Workframe evaluate_n(EvalContext& ctx) const override {
  Workframe awf = a_->evaluate_n(ctx);
  Workframe bwf = b_->evaluate_n(ctx);
  if (awf.ncols() == 1) awf.repeat_column(bwf.ncols());
  if (bwf.ncols() == 1) bwf.repeat_column(awf.ncols());
  if (awf.ncols() != bwf.ncols()) {
    throw TypeError() << "Incompatible number of columns in " << repr()
        << ": the first argument has " << awf.ncols() << ", while the "
        << "second has " << bwf.ncols();
  }
  awf.sync_grouping_mode(bwf);

  auto gmode = awf.get_grouping_mode();
  Workframe outputs(ctx);
  for (size_t i = 0; i < awf.ncols(); ++i) {
    Column rescol = evaluate1(awf.retrieve_column(i),
                              bwf.retrieve_column(i));
    outputs.add_column(std::move(rescol), std::string(), gmode);
  }
  return outputs;
}

The method evaluate1() will take a pair of two columns and produce the output column containing the result of gcd(a, b) calculation. We must take into account the stypes of both columns, and decide which stypes are acceptable for our function:

Column evaluate1(Column&& a, Column&& b) const {
  SType stype1 = a.stype();
  SType stype2 = b.stype();
  SType stype0 = common_stype(stype1, stype2);
  switch (stype0) {
    case SType::BOOL:
    case SType::INT8:
    case SType::INT16:
    case SType::INT32: return make<int32_t>(std::move(a), std::move(b), SType::INT32);
    case SType::INT64: return make<int64_t>(std::move(a), std::move(b), SType::INT64);
    default:
        throw TypeError() << "Invalid columns of types " << stype1 << " and "
            << stype2 << " in " << repr();
  }
}

template <typename T>
Column make(Column&& a, Column&& b, SType stype0) const {
  a.cast_inplace(stype0);
  b.cast_inplace(stype0);
  return Column(new Column_Gcd<T>(std::move(a), std::move(b)));
}

As you can see, the job of the FExpr_Gcd class is to produce a workframe containing one or more Column_Gcd virtual columns. This is where the actual calculation of GCD values will take place, and we shall declare this class too. It can be done either in a separate file in the core/column/ folder, or inside the current file expr/fexpr_gcd.cc.

#include "column/virtual.h"

template <typename T>
class Column_Gcd : public Virtual_ColumnImpl {
  private:
    Column acol_;
    Column bcol_;

  public:
    Column_Gcd(Column&& a, Column&& b)
      : Virtual_ColumnImpl(a.nrows(), a.stype()),
        acol_(std::move(a)), bcol_(std::move(b))
    {
      xassert(acol_.nrows() == bcol_.nrows());
      xassert(acol_.stype() == bcol_.stype());
      xassert(compatible_type<T>(acol_.stype()));
    }

    ColumnImpl* clone() const override {
      return new Column_Gcd(Column(acol_), Column(bcol_));
    }

    size_t n_children() const noexcept { return 2; }
    const Column& child(size_t i) { return i==0? acol_ : bcol_; }

    bool get_element(size_t i, T* out) {
      T a, b;
      bool avalid = acol_.get_element(i, &a);
      bool bvalid = bcol_.get_element(i, &b);
      if (avalid && bvalid) {
        while (b) {
          T tmp = b;
          b = a % b;
          a = tmp;
        }
        *out = a;
        return true;
      }
      return false;
    }
};

Python-facing gcd() function

Now that we have created the FExpr_Gcd class, we also need to have a python function responsible for creating these objects. This is done in 4 steps:

First, declare a function with signature py::oobj(const py::PKArgs&). The py::PKArgs object here encapsulates all parameters that were passed to the function, and it returns a py::oobj, which is a simple wrapper around python’s PyObject*.

static py::oobj py_gcd(const py::XArgs& args) {
  auto a = args[0].to_oobj();
  auto b = args[1].to_oobj();
  return PyFExpr::make(new FExpr_Gcd(as_fexpr(a), as_fexpr(b)));
}

This function takes the python arguments, if necessary validates and converts them into C++ objects, then creates a new FExpr_Gcd object, and then returns it wrapped into a PyFExpr (which is a python equivalent of the generic FExpr class).

In the second step, we declare the signature and the docstring of this python function:

static const char* doc_gcd =
R"(gcd(a, b)
--

Compute the greatest common divisor of `a` and `b`.

Parameters
----------
a, b: FExpr
    Only integer columns are supported.

return: FExpr
    The returned column will have stype int64 if either `a` or `b` are
    of type int64, or otherwise it will be int32.
)";

DECLARE_PYFN(&py_gcd)
    ->name("gcd")
    ->docs(doc_gcd)
    ->arg_names({"a", "b"})
    ->n_positional_args(2)
    ->n_required_args(2);

At this point the method will be visible from python in the _datatable module. So the next step is to import it into the main datatable module. To do this, go to src/datatable/__init__.py and write

from .lib._datatable import (
    ...
    gcd,
    ...
)
...
__all__ = (
    ...
    "gcd",
    ...
)

Tests

Any functionality must be properly tested. We recommend creating a dedicated test file for each new function. Thus, create file tests/expr/test-gcd.py and add some tests in it. We use the pytest framework for testing. In this framework, each test is a single function (whose name starts with test_) which performs some actions and then asserts the validity of results.

import pytest
import random
from datatable import dt, f, gcd
from tests import assert_equals  # checks equality of Frames
from math import gcd as math_gcd

def test_equal_columns():
    DT = dt.Frame(A=[1, 2, 3, 4, 5])
    RES = DT[:, gcd(f.A, f.A)]
    assert_equals(RES, dt.Frame([1, 1, 1, 1, 1]/dt.int32))

@pytest.mark.parametrize("seed", [random.getrandbits(63)])
def test_random(seed):
    random.seed(seed)
    n = 100
    src1 = [random.randint(1, 1000) for i in range(n)]
    src2 = [random.randint(1, 100) for i in range(n)]
    DT = dt.Frame(A=src1, B=src2)
    RES = DT[:, gcd(f.A, f.B)]
    assert_equals(RES, dt.Frame([math_gcd(src1[i], src2[i])
                                 for i in range(n)]))

When writing tests try to test any corner cases that you can think of. For example, what if one of the numbers is 0? Negative? Add tests for various column types, including invalid ones.

Documentation

The final piece of the puzzle is the documentation. We’ve already written the documentation for our function: the doc_gcd variable declared earlier. However, for now this is only visible from python when you run help(gcd). We also want the documentation to be visible on our official readthedocs website, which requires a few more steps. So:

First, create file docs/api/dt/gcd.rst. The content of the file should contain just few lines:

.. xfunction:: datatable.gcd
    :doc: src/core/fexpr/fexpr_gcd.cc doc_gcd
    :src: src/core/fexpr/fexpr_gcd.cc py_gcd
    :tests: tests/expr/test-gcd.py

In these lines we declare: in which source file the docstring can be found, and what is the name of its variable. The documentation generator will be looking for a static const char* doc_gcd variable in the source. Then we also declare the name of the function which provides the gcd functionality. The generator will look for a function with that name in the specified source file and create a link to that source in the output doc file. Lastly, the :tests: parameter says which file contains tests dedicated to this function, this will also become a link in the generated documentation.

This RST file now needs to be added to the toctree: open the file docs/api/index-api.rst and add it into the .. toctree:: list at the bottom, and also add it to the table of all functions.

Lastly, open docs/releases/v{LATEST}.rst (this is our changelog) and write a brief paragraph about the new function:

Frame
-----
...

-[new] Added new function :func:`gcd()` to compute the greatest common
  divisor of two columns. [#NNNN]

The [#NNNN] is a link to the GitHub issue where the gcd() function was requested.

Submodules

Some functions are declared within submodules of the datatable module. For example, math-related functions can be found in dt.math, string functions in dt.str, etc. Declaring such functions is not much different from what is described above. For example, if we wanted our gcd() function to be in the dt.math submodule, we’d made the following changes:

  • Create file expr/math/fexpr_gcd.cc instead of expr/fexpr_gcd.cc;

  • Instead of importing the function in src/datatable/__init__.py we’d have imported it from src/datatable/math.py;

  • The test file name can be tests/math/test-gcd.py instead of tests/expr/test-gcd.py;

  • The doc file name can be docs/api/math/gcd.rst instead of docs/api/dt/gcd.rst, and it should be added to the toctree in docs/api/math.rst.

Release History

Contributors

This page lists all people who have contributed to the development of datatable. We take into account both code and documentation contributions, as well as contributions in the form of bug reports and feature requests.

More specifically, a code contribution is considered any PR (pull request) that was merged into the codebase. The “complexity” of the PR is not taken into account as it is highly subjective. Next, an issue contribution is any closed issue except for those that are tagged as “question”, “wont-fix” or “cannot-reproduce”. Issues are attributed according to their closing date, not their creation date.

In the table, the contributors are sorted according to their total contribution score, which is the weighted sum of the count of each user’s code and issue contributions. Code contributions have more weight than issue contributions, and more recent contributions more weight than the older ones.

0.110.100.90.80.70.6past
Pasha Stetsenko
Oleksiy Kononenko
Michal Malohlava
Nishant Kalonia
Michal Raška
Samuel Oranyeli
Arno Candel
Anmol Bal
Jan Gorecki
Pradeep Krishnamurthy
Jonathan McKinney
Siddhesh Poyarekar
Viktor Demin
Bryce Boe
Liu Chi
Juliano Faccioni
Wes Morgan
Achraf Merzouki
Bijan Pourhamzeh
Angela Bartz
Michael Frasco
Mallesham Yamulla
Junghoo Cho
Corey Levinson
Jan Gamec
Suman Khanal
Navdeep Gill
Tom Kraljevic
Ben Gorman
Stephen Boesch
Jose Luis Avilez
Yu Zhu
Patrick Rice
NachiGithub
Hawk Berry
Olivier Grellier
Michael Moroz
Ashrith Barthur
Chrinide
Lucas Jamar
Suren Mohanathas
Toby Dylan Hocking
Timothy Salazar
Megan Kurka
Andy Troiano
Martin Dvorak
Igor Šušić
Koray AL
Todd
Nick Kim
Zmnako Awrahman
Sinan
Andres Torrubia
Matt Dancho
Mateusz Dymczyk
sentieonycdev
Mathias Müller
Joseph Granados
Qiang Kou (KK)
Achille M.
Govind Mohan
Hemen Kapadia
Leland Wilkinson
Sri Ambati
Mark Chan

Developer’s note: This table is auto-generated based on contributor lists in each of the version files, specified via the ..contributors:: directive. In turn, the list of contributors for each version has to be generated via the script ci/gh.py at the time of each release. The issues/PRs will be filtered according to their milestone. Thus, the issues/PRs that are not tagged with any milestone will not be taken into account.