Version 0.10.0¶
Version 0.10.0 | |
---|---|
Release date: | 2019-12-02 |
Next release: | Version 0.10.1 |
Previous release: | Version 0.9.0 |
Wheels | |
MacOS | python-3.5 |
python-3.6 | |
python-3.7 | |
Linux x86-64 | python-3.5 |
python-3.6 | |
python-3.7 |
Columnsets¶
See also
The f
-symbol syntax has been extended to allow selecting multiple columns
from a frame at once, so-called columnsets. The primary use case here is to
select a slice of columns, or to select columns based on their type:
f[:] # select all columns
f[:5] # select the first 5 columns
f["A":"Z"] # select columns from 'A' to 'Z'
f[float] # select all floating point columns
f[dt.int32] # select all columns with stype int32
f[int if select_ints else
float if select_floats else
None]
In addition, columnsets can be added/subtracted, allowing to express a richer selection of columns:
f[int].extend(f[float]) # all integers and floating point columns
f[:].remove(f[str]) # all columns except those of string type
f[:10].extend(f[-1]) # first 10, plus the last column
The columnsets can be used in places where a list/sequence of columns is
expected, such as the j
node of DT[i,j,...]
, the by()
function, etc.
Row-functions¶
We added a family of new “row-wise” aggregation functions, which operate on
any number of columns in a frame, producing a single column as a result. These
function aggregate across rows instead of columns (like traditional reducers),
and therefore they all have prefix row
:
rowall() # are all values True?
rowany() # is there any True value?
rowcount() # count non-missing values in a row
rowfirst() # find the first non-missing value
rowlast() # the last non-missing value
rowmax() # largest value in a row
rowmean() # mean of all values in a row
rowmin() # smallest value in a row
rowsd() # standard deviation of all row values
rowsum() # the sum of values in a row
These functions are equivalent to pandas’ aggregation functions with parameter
(axis=1)
. In addition, the first two of these functions (rowall()
and
rowany()
) use short-circuit evaluation.
Mathematical functions¶
New module datatable.math
now has implementations of most mathematical
functions found in numpy or in standard Python math module. These functions
work with Frame objects, and can be used in DT[i,j,...]
expressions:
DT[:, f.X * dt.math.cos(f.Phi) + f.Y * dt.math.sin(f.Phi)]
There are 48 functions in total.
Frame¶
Added function
update()
, which can be used to in aDT[i, j]
expression. This function can be used in thej
place, and it allows the user to create new columns in a Frame, or update the existing ones:DT = dt.Frame(A=range(5)) DT[:, update(A=f.A * 2, B=dt.str32(f.A), Z=0)]
Added method
.export_names()
which returns a tuple of variables referencing each column in the Frame:PROC_ID, SORT_NR, *other = DT.export_names() DT[(PROC_ID == "A") & (SORT_NR > 2), :]
If you need to export only a subset of columns, you can select those columns first via the standard
DT[i,j]
syntax:# Only create variables for the first 5 columns A, B, C, D, E = DT[:, :5].export_names()
Added frame property
.stype
which is similar to.stypes
except that it returns a single stype instead of a tuple. This method can only be used on a frame where all columns have the same stype, or there is only one column.When a frame is displayed in a console, it will now display the first 15 + the last 5 rows, similarly to how it is rendered in Jupyter notebook. Also, if the frame is 30 rows or less, it will be shown in full.
These parameters are configurable via the options
dt.options.display.head_nrows
,dt.options.display.tail_nrows
anddt.options.display.max_nrows
.Method
Frame.copy()
now has a new parameterdeep=False
. When set toTrue
, it will create a deep copy of the frame instead of the usual shallow one.In addition, standard python functions
copy.copy()
andcopy.deepcopy()
will now defer to theFrame.copy()
method too.It is now possible to create a Frame from a list of numpy integers/ floats. The resulting Frame will have the stype corresponding to the largest dtype among all elements in the list:
import numpy as np DT = dt.Frame([np.int32(1), np.int32(3), np.int64(8)]) assert DT.shape == (3, 1) assert DT.stype == dt.int64
When an integer column is used to select rows from a Frame, that column is now allowed to contain NA values, which produces a row filled with missing values:
DT = dt.Frame(A=['a', 'b', 'c', 'd', 'e']) rows = dt.Frame([2, 0, None, 1, 2]) assert DT[rows, :].to_list() == [['c', 'a', None, 'b', 'c']]
Added option
display.max_column_width
. Cells whose content is larger than this value will be automatically truncated when a Frame is rendered into a terminal.When selecting the key column from a keyed frame
DT[key]
, the resulting single-column frame will now retain its “keyed” property.Method
.to_csv()
gains two new boolean parameters:header=
andappend=
. Theheader=
parameter controls whether or not to write into the output the header row with column names. Theappend=
parameter allows the CSV content to be appended to an existing file instead of overwriting it:DT.to_csv("out.log", append=True) # infer that header=False if file exists
Range objects can now be used directly in
DT[i,j]
expressions in any place where a column could be expected:DT["id"] = range(1000)
Implemented ability to select a specific row within each group, using the syntax:
DT[2, :, by(f.GRP)]
If the index is invalid for some of the groups, those groups will be discarded.
Assigning a python type or an stype to a column or set of columns will now perform a type-cast on those columns:
DT["A"] = int # Column A in frame DT will become integer DT[:, int] = dt.float64 # All integer columns will be converted to float64
Method
Frame.materialize()
gains a new optionto_memory=False
. If set to True, it will force the Frame’s data to be lifted from disk into the main memory (if the frame was opened from disk):DT = dt.fread("data.jay") DT.materialize(to_memory=True)
The name deduplication algorithm now starts looking for candidate names starting from
name + dt.options.frame.name_auto_index
. For example, if you’re creating a Frame with column names["A", "A", "A"]
, then those names will be modified to ensure uniqueness. Before, they were changed into["A", "A.1", "A.2"]
; now they are changed into["A", "A.0", "A.1"]
(assuming the value of optionframe.name_auto_index
is0
).Frame created from a python list of small integers will now have stype
int32
, instead ofint8
orint16
as before. One can still create a column of typeint8
by requesting this stype explicitly:DT1 = dt.Frame([1, 2, 3]) DT2 = dt.Frame([1, 2, 3], stype=dt.int8) assert DT1.stype == dt.int32 assert DT2.stype == dt.int8
Thanks to @Viktor-Demin for the contribution. #2127
Keyed columns are now styled distinctly from regular columns when rendering the Frame into a Jupyter notebook. #1636
In Jupyter notebook Frame’s stylesheets are now injected during the datatable import. This makes it less likely that the stylesheets will get accidentally removed from the page. However, if it still does occur, there is now also a method to load those styles directly:
init_styles()
. #1871Fixed error when displaying
help(dt)
. #1931fread(cmd=)
now throws an error if it occurred while running the provided command cmd in the shell. Previously the error was silently discarded. #1935Creating a Frame from a degenerate range now produces an empty Frame instead of a 1-row Frame. #1942
Fixed crash when computing mode stat for a view frame. #1953
Fixed a bug where creating a new column via assignment would crash if the RHS of the assignment contained an expression that tried to use the column that was being created. #1983
Fixed a crash when joining a frame that had 0 rows. #1988
Increasing the number of rows in a keyed Frame was documented as invalid, but didn’t actually throw any errors. Now it does. #2021
Operations on a 0-row frame containing string columns will no longer cause an infinite loop. #2043
Conversion of a Frame into a masked numpy array was sometimes done incorrectly when some columns in the frame contained NAs, while others did not. #2050
Groupby operation on an empty (0-rows) frame now works correctly, returning a 0-row result frame. #2078 For example:
DT = dt.Frame(Id=[], Value=[]) # create a 0x2 frame DT[:, sum(f.Value), by(f.Id)] # produces a 0x2 frame DT[:, sum(f.Value)] # produces a 1x1 frame
Deleting columns from a keyed Frame no longer results in a crash when the deleted columns are part of the key. #2083
The
count()
reducer now always produces a column with stypeint64
. Before, it sometimes produced anint32
column, and sometimes anint64
column.Setting a key on a copied frame no longer affects the original frame. #2095
When a Frame has a string column containing special characters (such as newlines, tabs, or others from C0/C1 blocks), they will now be properly escaped when the frame is printed in a console. In addition, we now attempt to detect and properly handle 0-width and double-width characters in strings, so that when a column containing such unicode characters is displayed, it should not cause mis-alignment issues.
Option
dt.options.display.allow_unicode
is now respected when printing a Frame containing string columns with unicode data. These values will now be properly escaped if the option value isFalse
.Function
isna()
now returns correct result for a column obtained from joining another frame, provided that the join was only partially successful. #2109Fix creation of a Frame from a numpy array which was obtained from another numpy array as a slice with a negative stride. #2163
General¶
We no longer export symbols
open()
,abs()
,min()
,max()
andsum()
from datatable module when doingfrom datatable import *
. They are still available when looked up explicitly, i.e.dt.open()
will still work.Function
open()
is marked as deprecated, scheduled to be removed in version 0.12. Instead we recommend to usefread()
function to open Jay files.Support for NFF format was removed. This was an old datatable’s format for storing data frames on disk, and it was deprecated in favor of Jay over a year ago. If you still have any data stored in NFF format, we recommend to re-save in Jay using datatable 0.9.
Datatable module now exports symbol
dt
, which is the handle to the module itself. For example, you can now write:from datatable import dt, f, by, join
The symbol
dt
is also exported by default, i.e. it will be available if you dofrom datatable import *
.Added functions
cov()
andcorr()
to compute the covariance and Pearson correlation coefficient between columns of a Frame. These functions can be used in a group-by too:# Compute correlation of columns A and B, group-wise by ID DT[:, corr(f.A, f.B), by(f.ID)]
Added function
shift()
which can be used to generate lags/leads of a column. For example:DT[:, {"lag2": shift(f.A, n=2), "lag1": shift(f.A), # same as shift(f.A, n=1) "lag0": f.A, # same as shift(f.A, n=0) "lead1": shift(f.A, -1), "lead2": shift(f.A, -2), }]
This function is group-aware: when used in an expression containing a groupby, it will apply the shift separately within each group.
Fixed memory leak when writing a Frame into a CSV file. #2119
Fixed memory leak when converting a numpy array with string values into a Frame. #2123
Fixed memory leak during reduce operations. #2125
Column method
.len()
for computing string length now handles unicode strings correctly and returns the number of codepoints in the string instead of the number of bytes. #2160
Internal¶
Function
dt.internal.frame_column_rowindex(DT, i)
was removed and replaced withdt.internal.frame_columns_virtual(DT)
. The latter returns a tuple of True/False indicators of whether each column in a Frame is virtual or not.C API version increased to 2.
Removed C API methods and macros related to retrieval of a column’s rowindex:
DtFrame_ColumnRowindex()
,DtRowindex_Check()
,DtRowindex_Type()
,DtRowindex_Size()
,DtRowindex_UnpackSlice()
,DtRowindex_ArrayData()
,DtRowindex_NONE
,DtRowindex_ARR32
,DtRowindex_ARR64
,DtRowindex_SLICE
Added C API method
DtFrame_ColumnIsVirtual()
which returns a boolean indicator whether the column in a Frame is virtual or not.
Contributors¶
This release was created with the help of 7 people who contributed code and documentation, and 14 more people who submitted bug reports and feature requests.
Code & documentation contributors:
- Pasha Stetsenko
- Oleksiy Kononenko
- Siddhesh Poyarekar
- Anmol Bal
- Michal Malohlava
- Viktor Demin
- Achraf Merzouki
Issues contributors: