Frame

class datatable.Frame

Two-dimensional column-oriented table of data. Each column has its own name and type. Types may vary across columns (unlike in a Numpy array) but cannot vary within each column (unlike in Pandas DataFrame).

Internally the data is stored as C primitives, and processed using multithreaded native C++ code.

This is a primary data structure for datatable module.

cbind()

Append columns of Frames frames to the current Frame.

This is equivalent to pandas.concat(axis=1): the Frames are combined by columns, i.e. cbinding a Frame of shape [n x m] to a Frame of shape [n x k] produces a Frame of shape [n x (m + k)].

As a special case, if you cbind a single-row Frame, then that row will be replicated as many times as there are rows in the current Frame. This makes it easy to create constant columns, or to append reduction results (such as min/max/mean/etc) to the current Frame.

If Frame(s) being appended have different number of rows (with the exception of Frames having 1 row), then the operation will fail by default. You can force cbinding these Frames anyways by providing option force=True: this will fill all ‘short’ Frames with NAs. Thus there is a difference in how Frames with 1 row are treated compared to Frames with any other number of rows.

Parameters:
  • frames (sequence or list of Frames) – One or more Frame to append. They should have the same number of rows (unless option force is also used).
  • force (boolean) – If True, allows Frames to be appended even if they have unequal number of rows. The resulting Frame will have number of rows equal to the largest among all Frames. Those Frames which have less than the largest number of rows, will be padded with NAs (with the exception of Frames having just 1 row, which will be replicated instead of filling with NAs).
colindex()

Return index of the column name.

Parameters:name – name of the column to find the index for. This can also be an index of a column, in which case the index is checked that it doesn’t go out-of-bounds, and negative index is converted into positive.
Raises:ValueError – if the requested column does not exist.
copy()

Make a copy of this Frame.

This method creates a shallow copy of the current Frame: only references are copied, not the data itself. However, due to copy-on-write semantics any changes made to one of the Frames will not propagate to the other. Thus, for all intents and purposes the copied Frame will behave as if it was deep-copied.

countna()

Get the number of NA values in each column.

Returns:
  • A new datatable of shape (1, ncols) containing the counted number of NA
  • values in each column.
countna1()
head()

Return the first n rows of the Frame, same as self[:n, :].

key

Tuple of column names that serve as a primary key for this Frame.

If the Frame is not keyed, this will return an empty tuple.

Assigning to this property will make the Frame keyed by the specified column(s). The key columns will be moved to the front, and the Frame will be sorted. The values in the key columns must be unique.

ltypes

The tuple of each column’s ltypes (“logical types”)

materialize()
max()

Get the maximum value of each column.

Returns:
  • A new datatable of shape (1, ncols) containing the computed maximum
  • values for each column (or NA if not applicable).
max1()
mean()

Get the mean of each column.

Returns:
  • A new datatable of shape (1, ncols) containing the computed mean
  • values for each column (or NA if not applicable).
mean1()
min()

Get the minimum value of each column.

Returns:
  • A new datatable of shape (1, ncols) containing the computed minimum
  • values for each column (or NA if not applicable).
min1()
mode()

Get the modal value of each column.

Returns:
  • A new datatable of shape (1, ncols) containing the computed count of
  • most frequent values for each column.
mode1()
names

Tuple of column names.

You can rename the Frame’s columns by assigning a new list/tuple of names to this property. The length of the new list of names must be the same as the number of columns in the Frame.

It is also possible to rename just a few columns by assigning a dictionary {oldname: newname, ...}. Any column not listed in the dictionary will retain its name.

Examples

>>> d0 = dt.Frame([[1], [2], [3]])
>>> d0.names = ['A', 'B', 'C']
>>> d0.names
('A', 'B', 'C')
>>> d0.names = {'B': 'middle'}
>>> d0.names
('A', 'middle', 'C')
>>> del d0.names
>>> d0.names
('C0', 'C1', 'C2')
ncols

Number of columns in the Frame

nmodal()

Get the number of modal values in each column.

Returns:
  • A new datatable of shape (1, ncols) containing the counted number of
  • most frequent values in each column.
nmodal1()
nrows

Number of rows in the Frame.

Assigning to this property will change the height of the Frame, either by truncating if the new number of rows is smaller than the current, or filling with NAs if the new number of rows is greater.

Increasing the number of rows of a keyed Frame is not allowed.

nunique()

Get the number of unique values in each column.

Returns:
  • A new datatable of shape (1, ncols) containing the counted number of
  • unique values in each column.
nunique1()
rbind(*frames, force=False, bynames=True)

Append rows of frames to the current Frame.

This is equivalent to list.extend() in Python: the Frames are combined by rows, i.e. rbinding a Frame of shape [n x k] to a Frame of shape [m x k] produces a Frame of shape [(m + n) x k].

This method modifies the current Frame in-place. If you do not want the current Frame modified, then append all Frames to an empty Frame: dt.Frame().rbind(frame1, frame2).

If Frame(s) being appended have columns of types different from the current Frame, then these columns will be promoted to the largest of two types: bool -> int -> float -> string.

If you need to append multiple Frames, then it is more efficient to collect them into an array first and then do a single rbind(), than it is to append them one-by-one.

Appending data to a Frame opened from disk will force loading the current Frame into memory, which may fail with an OutOfMemory exception.

Parameters:
  • frames (sequence or list of Frames) – One or more Frame to append. These Frames should have the same columnar structure as the current Frame (unless option force is used).
  • force (boolean, default False) – If True, then the Frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs.
  • bynames (boolean, default True) – If True, the columns in Frames are matched by their names. For example, if one Frame has columns [“colA”, “colB”, “colC”] and the other [“colB”, “colA”, “colC”] then we will swap the order of the first two columns of the appended Frame before performing the append. However if bynames is False, then the column names will be ignored, and the columns will be matched according to their order, i.e. i-th column in the current Frame to the i-th column in each appended Frame.
replace()

Replace given value(s) replace_what with replace_with in the entire Frame.

For each replace value, this method operates only on columns of types appropriate for that value. For example, if replace_what is a list [-1, math.inf, None, “??”], then the value -1 will be replaced in integer columns only, math.inf only in real columns, None in columns of all types, and finally “??” only in string columns.

The replacement value must match the type of the target being replaced, otherwise an exception will be thrown. That is, a bool must be replaced with a bool, an int with an int, a float with a float, and a string with a string. The None value (representing NA) matches any column type, and therefore can be used as either replacement target, or replace value for any column. In particular, the following is valid: DT.replace(None, [-1, -1.0, “”]). This will replace NA values in int columns with -1, in real columns with -1.0, and in string columns with an empty string.

The replace operation never causes a column to change its logical type. Thus, an integer column will remain integer, string column remain string, etc. However, replacing may cause a column to change its stype, provided that ltype remains constant. For example, replacing 0 with -999 within an int8 column will cause that column to be converted into the int32 stype.

Parameters:
  • replace_what (None, bool, int, float, list, or dict) – Value(s) to search for and replace.
  • replace_with (single value, or list) – The replacement value(s). If replace_what is a single value, then this must be a single value too. If replace_what is a list, then this could be either a single value, or a list of the same length. If replace_what is a dict, then this value should not be passed.
Returns:

Return type:

Nothing, replacement is performed in-place.

Examples

>>> df = dt.Frame([1, 2, 3] * 3)
>>> df.replace(1, -1)
>>> df.to_list()
[[-1, 2, 3, -1, 2, 3, -1, 2, 3]]
>>> df.replace({-1: 100, 2: 200, "foo": None})
>>> df.to_list()
[[100, 200, 3, 100, 200, 3, 100, 200, 3]]
save(dest=None, format='jay', _strategy='auto')

Save Frame in binary NFF/Jay format.

Parameters:
  • dest – destination where the Frame should be saved.
  • format – either “nff” or “jay”
  • _strategy – one of “mmap”, “write” or “auto”
sd()

Get the standard deviation of each column.

Returns:
  • A new datatable of shape (1, ncols) containing the computed standard
  • deviation values for each column (or NA if not applicable).
sd1()
shape

Tuple with (nrows, ncols) dimensions of the Frame

stypes

The tuple of each column’s stypes (“storage types”)

sum()

Get the sum of each column.

Returns:
  • A new datatable of shape (1, ncols) containing the computed sums
  • for each column (or NA if not applicable).
sum1()
tail()

Return the last n rows of the Frame, same as self[-n:, :].

to_csv(path='', nthreads=0, hex=False, verbose=False, **kwargs)

Write the Frame into the provided file in CSV format.

Parameters:
  • dt (Frame) – Frame object to write into CSV.
  • path (str) – Path to the output CSV file that will be created. If the file already exists, it will be overwritten. If path is not given, then the Frame will be serialized into a string, and that string will be returned.
  • nthreads (int) – How many threads to use for writing. The value of 0 means to use all available threads. Negative values mean to use that many threads less than the maximum available.
  • hex (bool) – If True, then all floating-point values will be printed in hex format (equivalent to %a format in C printf). This format is around 3 times faster to write/read compared to usual decimal representation, so its use is recommended if you need maximum speed.
  • verbose (bool) – If True, some extra information will be printed to the console, which may help to debug the inner workings of the algorithm.
to_dict()

Convert the Frame into a dictionary of lists, by columns.

Returns a dictionary with ncols entries, each being the colname: coldata pair, where colname is a string, and coldata is an array of column’s data.

Examples

>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
>>> DT.to_dict()
{"A": [1, 2, 3], "B": ["aye", "nay", "tain"]}
to_list()

Convert the Frame into a list of lists, by columns.

Returns a list of ncols lists, each inner list representing one column of the Frame.

Examples

>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
>>> DT.to_list()
[[1, 2, 3], ["aye", "nay", "tain"]]
to_numpy(stype=None)

Convert Frame into a numpy array, optionally forcing it into a specific stype/dtype.

Parameters:stype (datatable.stype, numpy.dtype or str) – Cast datatable into this dtype before converting it into a numpy array.
to_pandas()

Convert Frame to a pandas DataFrame, or raise an error if pandas module is not installed.

to_tuples()

Convert the Frame into a list of tuples, by rows.

Returns a list having nrows tuples, where each tuple has length ncols and contains data from each respective row of the Frame.

Examples

>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
>>> DT.to_tuples()
[(1, "aye"), (2, "nay"), (3, "tain")]