# Getting started¶

## Install datatable¶

Let’s begin by installing the latest stable version of `datatable`

from PyPI:

```
$ pip install datatable
```

If this didn’t work for you, or if you want to install the bleeding edge version of the library, please check the Installation page.

Assuming the installation was successful, you can now import the library in a JupyterLab notebook or in a Python console:

```
import datatable as dt
print(dt.__version__)
```

## Loading data¶

The fundamental unit of analysis in datatable is a data `Frame`

. It is the
same notion as a pandas DataFrame or SQL table: data arranged in a
two-dimensional array with rows and columns.

You can create a `Frame`

object from a variety of data sources: from a python
list or dictionary, from a numpy array, or from a pandas DataFrame:

```
DT1 = dt.Frame(A=range(5), B=[1.7, 3.4, 0, None, -math.inf],
stypes={"A": dt.int64})
DT2 = dt.Frame(pandas_dataframe)
DT3 = dt.Frame(numpy_array)
```

You can also load a CSV/text/Excel file, or open a previously saved binary
`.jay`

file:

```
DT4 = dt.fread("~/Downloads/dataset_01.csv")
DT5 = dt.open("data.jay")
```

The `fread()`

function shown above is both powerful and extremely fast. It can
automatically detect parse parameters for the majority of text files, load data
from .zip archives or URLs, read Excel files, and much more.

## Data manipulation¶

Once the data is loaded into a Frame, you may want to do certain operations with it: extract/remove/modify subsets of the data, perform calculations, reshape, group, join with other datasets, etc. In datatable, the primary vehicle for all these operations is the square-bracket notation inspired by traditional matrix indexing but overcharged with power (this notation was pioneered in R data.table and is the main axis of intersection between these two libraries).

In short, almost all operations with a Frame can be expressed as

```
DT[i, j, ...]
```

where `i`

is the row selector, `j`

is the column selector, and `...`

indicates
that additional modifiers might be added. If this looks familiar to you,
that’s because it is. Exactly the same `DT[i, j]`

notation is used in
mathematics when indexing matrices, in C/C++, in R, in pandas, in numpy, etc.
The only difference that datatable introduces is that it allows
`i`

to be anything that can conceivably be
interpreted as a row selector: an integer to select just one row, a slice,
a range, a list of integers, a list of slices, an expression, a boolean-valued
Frame, an integer-valued Frame, an integer numpy array, a generator, and so on.

The `j`

column selector is even more versatile.
In the simplest case, you can select just a single column by its index or name. But
also accepted are a list of columns, a slice, a string slice (of the form `"A":"Z"`

), a
list of booleans indicating which columns to pick, an expression, a list of
expressions, and a dictionary of expressions. (The keys will be used as new names
for the columns being selected.) The `j`

expression can even be a python type (such as `int`

or `dt.float32`

),
selecting all columns matching that type.

In addition to the selector expression shown above, we support the update and delete statements too:

```
DT[i, j] = r
del DT[i, j]
```

The first expression will replace values in the subset `[i, j]`

of Frame
`DT`

with the values from `r`

, which could be either a constant, or a
suitably-sized Frame, or an expression that operates on frame `DT`

.

The second expression deletes values in the subset `[i, j]`

. This is
interpreted as follows: if `i`

selects all rows, then the columns given by
`j`

are removed from the Frame; if `j`

selects all columns, then the rows
given by `i`

are removed; if neither `i`

nor `j`

span all rows/columns
of the Frame, then the elements in the subset `[i, j]`

are replaced with
NAs.

## What the f.?¶

You may have noticed already that we mentioned several times the possibility
of using expressions in `i`

or `j`

and in other places. In the simplest form
an expression looks like:

```
f.ColA
```

which indicates a column `ColA`

in some Frame. Here `f`

is a variable that
has to be imported from the datatable module. This variable provides a convenient
way to reference any column in a Frame. In addition to the notation above, the
following is also supported:

```
f[3]
f["ColB"]
```

denoting the fourth column and the column `ColB`

respectively.

These f-expression support arithmetic operations as well as various mathematical and
aggregate functions. For example, in order to select the values from column
`A`

normalized to range `[0; 1]`

we can write the following:

```
from datatable import f, min, max
DT[:, (f.A - min(f.A))/(max(f.A) - min(f.A))]
```

This is equivalent to the following SQL query:

```
SELECT (f.A - MIN(f.A))/(MAX(f.A) - MIN(f.A)) FROM DT AS f
```

So, what exactly is `f`

? We call it a “frame proxy”, as it becomes a
simple way to refer to the Frame that we currently operate on. More precisely,
whenever `DT[i, j]`

is evaluated and we encounter an `f`

-expression there,
that `f`

becomes replaced with the frame `DT`

, and the columns are looked
up on that Frame. The same expression can later on be applied to a different
Frame, and it will refer to the columns in that other Frame.

At some point you may notice that that datatable also exports symbol `g`

. This
`g`

is also a frame proxy; however it already refers to the *second* frame in
the evaluated expression. This second frame appears when you are *joining* two
or more frames together (more on that later). When that happens, symbol `g`

is
used to refer to the columns of the joined frame.

## Groupbys/joins¶

In the Data Manipulation section we mentioned that the `DT[i, j, ...]`

selector
can take zero or more modifiers, which we denoted as “`...`

”. The available
modifiers are `by()`

, `join()`

and `sort()`

. Thus, the full form of the
square-bracket selector is:

```
DT[i, j, by(), sort(), join()]
```

### by(…)¶

This modifier splits the frame into groups by the provided column(s), and then
applies `i`

and `j`

within
each group. This mostly affects aggregator functions such as `sum()`

,
`min()`

or `sd()`

, but may also apply in other circumstances. For example,
if `i`

is a slice that takes the first 5 rows of a frame,
then in the presence of the `by()`

modifier it will take the first 5 rows of
each group.

For example, in order to find the total amount of each product sold, write:

```
from datatable import f, by, sum
DT = dt.fread("transactions.csv")
DT[:, sum(f.quantity), by(f.product_id)]
```

### sort(…)¶

This modifier controls the order of the rows in the result, much like SQL clause
`ORDER BY`

. If used in conjunction with `by()`

, it will order the rows
within each group.

### join(…)¶

As the name suggests, this operator allows you to join another frame to the
current, equivalent to the SQL `JOIN`

operator. Currently we support only
left outer joins.

In order to join frame `X`

, it must be keyed. A keyed frame is conceptually
similar to a SQL table with a unique primary key. This key may be either a
single column, or several columns:

```
X.key = "id"
```

Once a frame is keyed, it can be joined to another frame `DT`

, provided that
`DT`

has the column(s) with the same name(s) as the key in `X`

:

```
DT[:, :, join(X)]
```

This has the semantics of a natural left outer join. The `X`

frame can be
considered as a dictionary, where the key column contains the keys, and all
other columns are the corresponding values. Then during the join each row of
`DT`

will be matched against the row of `X`

with the same value of the
key column, and if there are no such value in `X`

, with an all-NA row.

The columns of the joined frame can be used in expressions using the `g.`

prefix, for example:

```
DT[:, sum(f.quantity * g.price), join(products)]
```

Note

In the future, we will expand the syntax of the join operator to allow other kinds of joins and also to remove the limitation that only keyed frames can be joined.

## Offloading data¶

Just as our work has started with loading some data into `datatable`

, eventually
you will want to do the opposite: store or move the data somewhere else. We
support multiple mechanisms for this.

First, the data can be converted into a pandas DataFrame or into a numpy array. (Obviously, you have to have pandas or numpy libraries installed.):

```
DT.to_pandas()
DT.to_numpy()
```

A frame can also be converted into python native data structures: a dictionary, keyed by the column names; a list of columns, where each column is itself a list of values; or a list of rows, where each row is a tuple of values:

```
DT.to_dict()
DT.to_list()
DT.to_tuples()
```

You can also save a frame into a CSV file, or into a binary .jay file:

```
DT.to_csv("out.csv")
DT.to_jay("data.jay")
```