Using datatable¶
This section describes common functionality and commands that you can run in datatable.
Create Frame¶
You can create a Frame from a variety of sources, including numpy arrays, pandas DataFrames, raw Python objects, etc:
import datatable as dt
import numpy as np
np.random.seed(1)
dt.Frame(np.random.randn(1000000))
| C0 | |
|---|---|
| ▪▪▪▪▪▪▪▪ | |
| 0 | 1.62435 |
| 1 | −0.611756 |
| 2 | −0.528172 |
| 3 | −1.07297 |
| 4 | 0.865408 |
| 5 | −2.30154 |
| 6 | 1.74481 |
| 7 | −0.761207 |
| 8 | 0.319039 |
| 9 | −0.24937 |
| ⋮ | ⋮ |
| 999,995 | 0.0595784 |
| 999,996 | 0.140349 |
| 999,997 | −0.596161 |
| 999,998 | 1.18604 |
| 999,999 | 0.313398 |
import pandas as pd
pf = pd.DataFrame({"A": range(1000)})
dt.Frame(pf)
| A | |
|---|---|
| ▪▪▪▪▪▪▪▪ | |
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
| ⋮ | ⋮ |
| 995 | 995 |
| 996 | 996 |
| 997 | 997 |
| 998 | 998 |
| 999 | 999 |
dt.Frame({"n": [1, 3], "s": ["foo", "bar"]})
| n | s | |
|---|---|---|
| ▪ | ▪▪▪▪ | |
| 0 | 1 | foo |
| 1 | 3 | bar |
Convert a Frame¶
Convert an existing Frame into a numpy array, a pandas DataFrame, or a pure Python object:
nparr = df1.to_numpy()
pddfr = df1.to_pandas()
pyobj = df1.to_list()
Parse Text (csv) Files¶
datatable provides fast and convenient parsing of text (csv) files:
df = dt.fread("train.csv")
The datatable parser
- Automatically detects separators, headers, column types, quoting rules, etc.
- Reads from file, URL, shell, raw text, archives, glob
- Provides multi-threaded file reading for maximum speed
- Includes a progress indicator when reading large files
- Reads both RFC4180-compliant and non-compliant files
Write the Frame¶
Write the Frame’s content into a csv file (also multi-threaded):
df.to_csv("out.csv")
Save a Frame¶
Save a Frame into a binary format on disk, then open it later instantly, regardless of the data size:
df.to_jay("out.jay")
df2 = dt.open("out.jay")
Basic Frame Properties¶
Basic Frame properties include:
print(df.shape) # (nrows, ncols)
print(df.names) # column names
print(df.stypes) # column types
Compute Per-Column Summary Stats¶
Compute per-column summary stats using:
df.sum()
df.max()
df.min()
df.mean()
df.sd()
df.mode()
df.nmodal()
df.nunique()
Select Subsets of Rows/Columns¶
Select subsets of rows and/or columns using:
df[:, "A"] # select 1 column
df[:10, :] # first 10 rows
df[::-1, "A":"D"] # reverse rows order, columns from A to D
df[27, 3] # single element in row 27, column 3 (0-based)
Delete Rows/Columns¶
Delete rows and or columns using:
del df[:, "D"] # delete column D
del df[f.A < 0, :] # delete rows where column A has negative values
Filter Rows¶
Filter rows via an expression using the following. In this example, mean, sd, f are all symbols imported from datatable.
df[(f.x > mean(f.y) + 2.5 * sd(f.y)) | (f.x < -mean(f.y) - sd(f.y)), :]
Compute Columnar Expressions¶
Compute columnar expressions using:
df[:, {"x": f.x, "y": f.y, "x+y": f.x + f.y, "x-y": f.x - f.y}]
Append Rows/Columns¶
Append rows / columns to a Frame using:
df1.cbind(df2, df3)
df1.rbind(df4, force=True)