Python library for efficient multi-threaded data processing, with the support for out-of-memory datasets.


Data is everywhere. From the smallest photon interactions to galaxy collisions, from mouse movements on a screen to economic developments of countries, we are surrounded by the sea of information. The human mind cannot comprehend this data in all its complexity; since ancient times people found it much easier to reduce the dimensionality, to impose a strict order, to arrange the data points neatly on a rectangular grid: to make a data table.

But once the data has been collected into a table, it has been tamed. It may still need some grooming and exercise, essentially so it is no longer scary. Even if it is really Big Data, with the right tools you can approach it, play with it, bend it to your will, master it.

Python datatable module is the right tool for the task. It is a library that implements a wide (and growing) range of operators for manipulating two-dimensional data frames. It focuses on: big data support, high performance, both in-memory and out-of-memory datasets, and multi-threaded algorithms. In addition, datatable strives to achieve good user experience, helpful error messages, and powerful API similar to R data.table’s.