Version 0.10.0¶

Version 0.10.0
Release date:	2019-12-02
Next release:	Version 0.10.1
Previous release:	Version 0.9.0
Wheels
MacOS	python-3.5
	python-3.6
	python-3.7
Linux x86-64	python-3.5
	python-3.6
	python-3.7

Columnsets¶

Row-functions¶

We added a family of new “row-wise” aggregation functions, which operate on any number of columns in a frame, producing a single column as a result. These function aggregate across rows instead of columns (like traditional reducers), and therefore they all have prefix row:

rowall()    # are all values True?
rowany()    # is there any True value?
rowcount()  # count non-missing values in a row
rowfirst()  # find the first non-missing value
rowlast()   # the last non-missing value
rowmax()    # largest value in a row
rowmean()   # mean of all values in a row
rowmin()    # smallest value in a row
rowsd()     # standard deviation of all row values
rowsum()    # the sum of values in a row

These functions are equivalent to pandas’ aggregation functions with parameter (axis=1). In addition, the first two of these functions (rowall() and rowany()) use short-circuit evaluation.

Mathematical functions¶

New module dt.math now has implementations of most mathematical functions found in numpy or in standard Python math module. These functions work with Frame objects, and can be used in DT[i,j,...] expressions:

DT[:, f.X * dt.math.cos(f.Phi) + f.Y * dt.math.sin(f.Phi)]

There are 48 functions in total.

Frame¶

Added function dt.update(), which can be used to in a DT[i, j] expression. This function can be used in the j place, and it allows the user to create new columns in a Frame, or update the existing ones:

DT = dt.Frame(A=range(5)) DT[:, update(A=f.A * 2, B=dt.str32(f.A), Z=0)]
Added method .export_names() which returns a tuple of variables referencing each column in the Frame:

PROC_ID, SORT_NR, *other = DT.export_names() DT[(PROC_ID == "A") & (SORT_NR > 2), :]

If you need to export only a subset of columns, you can select those columns first via the standard DT[i, j] syntax:

# Only create variables for the first 5 columns A, B, C, D, E = DT[:, :5].export_names()
Added frame property .stype which is similar to .stypes except that it returns a single stype instead of a tuple. This method can only be used on a frame where all columns have the same stype, or there is only one column.
When a frame is displayed in a console, it will now display the first 15 + the last 5 rows, similarly to how it is rendered in Jupyter notebook. Also, if the frame is 30 rows or less, it will be shown in full.

These parameters are configurable via the options dt.options.display.head_nrows, dt.options.display.tail_nrows and dt.options.display.max_nrows.
Method .copy() now has a new parameter deep=False. When set to True, it will create a deep copy of the frame instead of the usual shallow one.

In addition, standard python functions copy.copy() and copy.deepcopy() will now defer to the Frame.copy() method too.
It is now possible to create a Frame from a list of numpy integers/ floats. The resulting Frame will have the stype corresponding to the largest dtype among all elements in the list:

import numpy as np DT = dt.Frame([np.int32(1), np.int32(3), np.int64(8)]) assert DT.shape == (3, 1) assert DT.stype == dt.int64
When an integer column is used to select rows from a Frame, that column is now allowed to contain NA values, which produces a row filled with missing values:

DT = dt.Frame(A=['a', 'b', 'c', 'd', 'e']) rows = dt.Frame([2, 0, None, 1, 2]) assert DT[rows, :].to_list() == [['c', 'a', None, 'b', 'c']]
Added option display.max_column_width. Cells whose content is larger than this value will be automatically truncated when a Frame is rendered into a terminal.
When selecting the key column from a keyed frame DT[key], the resulting single-column frame will now retain its “keyed” property.
Method .to_csv() gains two new boolean parameters: header= and append=. The header= parameter controls whether or not to write into the output the header row with column names. The append= parameter allows the CSV content to be appended to an existing file instead of overwriting it:

DT.to_csv("out.log", append=True) # infer that header=False if file exists
Range objects can now be used directly in DT[i, j] expressions in any place where a column could be expected:

DT["id"] = range(1000)
Implemented ability to select a specific row within each group, using the syntax:

DT[2, :, by(f.GRP)]

If the index is invalid for some of the groups, those groups will be discarded.
Assigning a python type or an stype to a column or set of columns will now perform a type-cast on those columns:

DT["A"] = int # Column A in frame DT will become integer DT[:, int] = dt.float64 # All integer columns will be converted to float64
Method Frame.materialize() gains a new option to_memory=False. If set to True, it will force the Frame’s data to be lifted from disk into the main memory (if the frame was opened from disk):

DT = dt.fread("data.jay") DT.materialize(to_memory=True)
The name deduplication algorithm now starts looking for candidate names starting from name + dt.options.frame.name_auto_index. For example, if you’re creating a Frame with column names ["A", "A", "A"], then those names will be modified to ensure uniqueness. Before, they were changed into ["A", "A.1", "A.2"]; now they are changed into ["A", "A.0", "A.1"] (assuming the value of option frame.name_auto_index is 0).
Frame created from a python list of small integers will now have stype int32, instead of int8 or int16 as before. One can still create a column of type int8 by requesting this stype explicitly:

DT1 = dt.Frame([1, 2, 3]) DT2 = dt.Frame([1, 2, 3], stype=dt.int8) assert DT1.stype == dt.int32 assert DT2.stype == dt.int8

Thanks to @Viktor-Demin for the contribution. #2127
Keyed columns are now styled distinctly from regular columns when rendering the Frame into a Jupyter notebook. #1636
In Jupyter notebook Frame’s stylesheets are now injected during the datatable import. This makes it less likely that the stylesheets will get accidentally removed from the page. However, if it still does occur, there is now also a method to load those styles directly: dt.init_styles(). #1871
Fixed error when displaying help(dt). #1931
fread(cmd=) now throws an error if it occurred while running the provided command cmd in the shell. Previously the error was silently discarded. #1935
Creating a Frame from a degenerate range now produces an empty Frame instead of a 1-row Frame. #1942
Fixed crash when computing mode stat for a view frame. #1953
Fixed a bug where creating a new column via assignment would crash if the RHS of the assignment contained an expression that tried to use the column that was being created. #1983
Fixed a crash when joining a frame that had 0 rows. #1988
Increasing the number of rows in a keyed Frame was documented as invalid, but didn’t actually throw any errors. Now it does. #2021
Operations on a 0-row frame containing string columns will no longer cause an infinite loop. #2043
Conversion of a Frame into a masked numpy array was sometimes done incorrectly when some columns in the frame contained NAs, while others did not. #2050
Groupby operation on an empty (0-rows) frame now works correctly, returning a 0-row result frame. #2078 For example:

DT = dt.Frame(Id=[], Value=[]) # create a 0x2 frame DT[:, sum(f.Value), by(f.Id)] # produces a 0x2 frame DT[:, sum(f.Value)] # produces a 1x1 frame
Deleting columns from a keyed Frame no longer results in a crash when the deleted columns are part of the key. #2083
The dt.count() reducer now always produces a column with stype int64. Before, it sometimes produced an int32 column, and sometimes an int64 column.
Setting a key on a copied frame no longer affects the original frame. #2095
When a Frame has a string column containing special characters (such as newlines, tabs, or others from C0/C1 blocks), they will now be properly escaped when the frame is printed in a console. In addition, we now attempt to detect and properly handle 0-width and double-width characters in strings, so that when a column containing such unicode characters is displayed, it should not cause mis-alignment issues.
Option dt.options.display.allow_unicode is now respected when printing a Frame containing string columns with unicode data. These values will now be properly escaped if the option value is False.
Function dt.math.isna() now returns correct result for a column obtained from joining another frame, provided that the join was only partially successful. #2109
Fix creation of a Frame from a numpy array which was obtained from another numpy array as a slice with a negative stride. #2163

General¶

We no longer export symbols dt.open(), dt.abs(), min(), max() and sum() from datatable module when doing from datatable import *. They are still available when looked up explicitly, i.e. dt.open() will still work.
Function dt.open() is marked as deprecated, scheduled to be removed in version 0.12. Instead we recommend to use fread() function to open Jay files.
Support for NFF format was removed. This was an old datatable’s format for storing data frames on disk, and it was deprecated in favor of Jay over a year ago. If you still have any data stored in NFF format, we recommend to re-save in Jay using datatable 0.9.
Datatable module now exports symbol dt, which is the handle to the module itself. For example, you can now write:

from datatable import dt, f, by, join

The symbol dt is also exported by default, i.e. it will be available if you do from datatable import *.
Added functions cov() and corr() to compute the covariance and Pearson correlation coefficient between columns of a Frame. These functions can be used in a group-by too:

# Compute correlation of columns A and B, group-wise by ID DT[:, corr(f.A, f.B), by(f.ID)]
Added function shift() which can be used to generate lags/leads of a column. For example:

DT[:, {"lag2": shift(f.A, n=2), "lag1": shift(f.A), # same as shift(f.A, n=1) "lag0": f.A, # same as shift(f.A, n=0) "lead1": shift(f.A, -1), "lead2": shift(f.A, -2), }]

This function is group-aware: when used in an expression containing a groupby, it will apply the shift separately within each group.
Fixed memory leak when writing a Frame into a CSV file. #2119
Fixed memory leak when converting a numpy array with string values into a Frame. #2123
Fixed memory leak during reduce operations. #2125
Column method .len() for computing string length now handles unicode strings correctly and returns the number of codepoints in the string instead of the number of bytes. #2160

Internal¶

Function dt.internal.frame_column_rowindex(DT, i) was removed and replaced with dt.internal.frame_columns_virtual(DT). The latter returns a tuple of True/False indicators of whether each column in a Frame is virtual or not.
C API version increased to 2.
Removed C API methods and macros related to retrieval of a column’s rowindex:
- DtFrame_ColumnRowindex(),
- DtRowindex_Check(),
- DtRowindex_Type(),
- DtRowindex_Size(),
- DtRowindex_UnpackSlice(),
- DtRowindex_ArrayData(),
- DtRowindex_NONE,
- DtRowindex_ARR32,
- DtRowindex_ARR64,
- DtRowindex_SLICE
Added C API method DtFrame_ColumnIsVirtual() which returns a boolean indicator whether the column in a Frame is virtual or not.

Contributors¶

This release was created with the help of 7 people who contributed code and documentation, and 14 more people who submitted bug reports and feature requests.

Code & documentation contributors:

Issues contributors: