Frame

class datatable.Frame

Two-dimensional column-oriented table of data. Each column has its own name and type. Types may vary across columns (unlike in a Numpy array) but cannot vary within each column (unlike in Pandas DataFrame).

Internally the data is stored as C primitives, and processed using multithreaded native C++ code.

This is a primary data structure for datatable module.

cbind()

Append columns of Frames frames to the current Frame.

This is equivalent to pandas.concat(axis=1): the Frames are combined by columns, i.e. cbinding a Frame of shape [n x m] to a Frame of shape [n x k] produces a Frame of shape [n x (m + k)].

As a special case, if you cbind a single-row Frame, then that row will be replicated as many times as there are rows in the current Frame. This makes it easy to create constant columns, or to append reduction results (such as min/max/mean/etc) to the current Frame.

If Frame(s) being appended have different number of rows (with the exception of Frames having 1 row), then the operation will fail by default. You can force cbinding these Frames anyways by providing option force=True: this will fill all ‘short’ Frames with NAs. Thus there is a difference in how Frames with 1 row are treated compared to Frames with any other number of rows.

Parameters:
  • frames (sequence or list of Frames) – One or more Frame to append. They should have the same number of rows (unless option force is also used).
  • force (boolean) – If True, allows Frames to be appended even if they have unequal number of rows. The resulting Frame will have number of rows equal to the largest among all Frames. Those Frames which have less than the largest number of rows, will be padded with NAs (with the exception of Frames having just 1 row, which will be replicated instead of filling with NAs).
colindex()

Return index of the column name, or raises a ValueError if the requested column does not exist.

Parameters:name (str) – The name of the column for which the index is sought. This can also be an index of a column, in which case the index is checked that it doesn’t go out-of-bounds, and negative index is converted into positive.
copy()

Make a copy of this frame.

This method creates a shallow copy of the current frame: only references are copied, not the data itself. However, due to copy-on-write semantics any changes made to one of the frames will not propagate to the other. Thus, for all intents and purposes the copied frame will behave as if it was deep-copied.

countna()
countna1()
head()

Return the first n rows of the frame, same as self[:n, :].

key

Tuple of column names that serve as a primary key for this Frame.

If the Frame is not keyed, this will return an empty tuple.

Assigning to this property will make the Frame keyed by the specified column(s). The key columns will be moved to the front, and the Frame will be sorted. The values in the key columns must be unique.

ltypes

The tuple of each column’s ltypes (“logical types”)

materialize()

Convert a “view” frame into a regular data frame.

Certain datatable operation produce frames that contain “view” columns. These columns refer to the data in some other column, via a RowIndex object that describes which values from the other column should be picked. This is done in order to improve performance and reduce memory usage of certain operations: a view column avoids copying data from its parent column.

Usually view columns are created transparently to the user, and they are materialized by datatable when necessary. This method, on the other hand, will force all view columns in the frame to be materialized immediately.

max()
max1()
mean()
mean1()
min()
min1()
mode()
mode1()
names

Tuple of column names.

You can rename the Frame’s columns by assigning a new list/tuple of names to this property. The length of the new list of names must be the same as the number of columns in the Frame.

It is also possible to rename just a few columns by assigning a dictionary {oldname: newname, ...}. Any column not listed in the dictionary will retain its name.

Examples

>>> d0 = dt.Frame([[1], [2], [3]])
>>> d0.names = ['A', 'B', 'C']
>>> d0.names
('A', 'B', 'C')
>>> d0.names = {'B': 'middle'}
>>> d0.names
('A', 'middle', 'C')
>>> del d0.names
>>> d0.names
('C0', 'C1', 'C2)
ncols

Number of columns in the Frame

nmodal()
nmodal1()
nrows

Number of rows in the Frame.

Assigning to this property will change the height of the Frame, either by truncating if the new number of rows is smaller than the current, or filling with NAs if the new number of rows is greater.

Increasing the number of rows of a keyed Frame is not allowed.

nunique()
nunique1()
rbind()

Append rows of frames to the current frame.

This is equivalent to list.extend() in Python: the frames are combined by rows, i.e. rbinding a frame of shape [n x k] to a Frame of shape [m x k] produces a frame of shape [(m + n) x k].

This method modifies the current frame in-place. If you do not want the current frame modified, then use dt.rbind() function.

If frame(s) being appended have columns of types different from the current frame, then these columns will be promoted to the largest of their types: bool -> int -> float -> string.

If you need to append multiple frames, then it is more efficient to collect them into an array first and then do a single rbind(), than it is to append them one-by-one.

Appending data to a frame opened from disk will force loading the current frame into memory, which may fail with an OutOfMemory exception if the frame is sufficiently big.

Parameters:
  • frames (sequence or list of Frames) – One or more frame to append. These frames should have the same columnar structure as the current frame (unless option force is used).
  • force (bool) – If True, then the frames are allowed to have mismatching set of columns. Any gaps in the data will be filled with NAs.
  • bynames (bool) – If True (default), the columns in frames are matched by their names. For example, if one frame has columns [“colA”, “colB”, “colC”] and the other [“colB”, “colA”, “colC”] then we will swap the order of the first two columns of the appended frame before performing the append. However if bynames is False, then the column names will be ignored, and the columns will be matched according to their order, i.e. i-th column in the current frame to the i-th column in each appended frame.
replace()

Replace given value(s) replace_what with replace_with in the entire Frame.

For each replace value, this method operates only on columns of types appropriate for that value. For example, if replace_what is a list [-1, math.inf, None, “??”], then the value -1 will be replaced in integer columns only, math.inf only in real columns, None in columns of all types, and finally “??” only in string columns.

The replacement value must match the type of the target being replaced, otherwise an exception will be thrown. That is, a bool must be replaced with a bool, an int with an int, a float with a float, and a string with a string. The None value (representing NA) matches any column type, and therefore can be used as either replacement target, or replace value for any column. In particular, the following is valid: DT.replace(None, [-1, -1.0, “”]). This will replace NA values in int columns with -1, in real columns with -1.0, and in string columns with an empty string.

The replace operation never causes a column to change its logical type. Thus, an integer column will remain integer, string column remain string, etc. However, replacing may cause a column to change its stype, provided that ltype remains constant. For example, replacing 0 with -999 within an int8 column will cause that column to be converted into the int32 stype.

Parameters:
  • replace_what (None, bool, int, float, list, or dict) – Value(s) to search for and replace.
  • replace_with (single value, or list) – The replacement value(s). If replace_what is a single value, then this must be a single value too. If replace_what is a list, then this could be either a single value, or a list of the same length. If replace_what is a dict, then this value should not be passed.
Returns:

Return type:

Nothing, replacement is performed in-place.

Examples

>>> df = dt.Frame([1, 2, 3] * 3)
>>> df.replace(1, -1)
>>> df.to_list()
[[-1, 2, 3, -1, 2, 3, -1, 2, 3]]
>>> df.replace({-1: 100, 2: 200, "foo": None})
>>> df.to_list()
[[100, 200, 3, 100, 200, 3, 100, 200, 3]]
save(path, format='jay', _strategy='auto')
sd()
sd1()
shape

Tuple with (nrows, ncols) dimensions of the Frame

stypes

The tuple of each column’s stypes (“storage types”)

sum()
sum1()
tail()

Return the last n rows of the frame, same as self[-n:, :].

to_csv()

Write the Frame into the provided file in CSV format.

Parameters:
  • path (str) – Path to the output CSV file that will be created. If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.
  • nthreads (int) – How many threads to use for writing. The value of 0 means to use all available threads. Negative values indicate to use that many threads less than the maximum available. If this parameter is omitted then dt.options.nthreads will be used.
  • hex (bool) – If True, then all floating-point values will be printed in hex format (equivalent to %a format in C printf). This format is around 3 times faster to write/read compared to usual decimal representation, so its use is recommended if you need maximum speed.
  • verbose (bool) – If True, some extra information will be printed to the console, which may help to debug the inner workings of the algorithm.
  • _strategy ("mmap" | "write" | "auto") – Which method to use for writing to disk. On certain systems ‘mmap’ gives a better performance; on other OSes ‘mmap’ may not work at all.
to_dict()

Convert the Frame into a dictionary of lists, by columns.

Returns a dictionary with ncols entries, each being the colname: coldata pair, where colname is a string, and coldata is an array of column’s data.

Examples

>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
>>> DT.to_dict()
{"A": [1, 2, 3], "B": ["aye", "nay", "tain"]}
to_jay()

Save this frame to a binary file on disk, in .jay format.

Parameters:
  • path (str) – The destination file name. Although not necessary, we recommend using extension “.jay” for the file. If the file exists, it will be overwritten. If this argument is omitted, the file will be created in memory instead, and returned as a bytes object.
  • _strategy ('mmap' | 'write' | 'auto') – Which method to use for writing the file to disk. The “write” method is more portable across different operating systems, but may be slower. This parameter has no effect when path is omitted.
to_list()

Convert the Frame into a list of lists, by columns.

Returns a list of ncols lists, each inner list representing one column of the Frame.

Examples

>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
>>> DT.to_list()
[[1, 2, 3], ["aye", "nay", "tain"]]
to_numpy()

Convert frame into a 2D numpy array, optionally forcing it into the specified stype/dtype.

In a limited set of circumstances the returned numpy array will be created as a data view, avoiding copying the data. This happens if all of these conditions are met:

  • the frame is not a view;
  • the frame has only 1 column;
  • the column’s type is not string;
  • the stype argument was not used.

In all other cases the returned numpy array will have a copy of the frame’s data. If the frame has multiple columns of different stypes, then the values will be upcasted into the smallest common stype.

If the frame has any NA values, then the returned numpy array will be an instance of numpy.ma.masked_array.

Parameters:
  • stype (datatable.stype, numpy.dtype or str) – Cast frame into this stype before converting it into a numpy array.
  • column (int) – Convert only the specified column; the returned value will be a 1D-array instead of a regular 2D-array.
to_pandas()

Convert this frame to a pandas DataFrame.

The pandas module is required to run this function.

to_tuples()

Convert the Frame into a list of tuples, by rows.

Returns a list having nrows tuples, where each tuple has length ncols and contains data from each respective row of the Frame.

Examples

>>> DT = dt.Frame(A=[1, 2, 3], B=["aye", "nay", "tain"])
>>> DT.to_tuples()
[(1, "aye"), (2, "nay"), (3, "tain")]