This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.

See fread() for all the available parameters.

Note: If you wish to read in multiple files, use iread(); it returns an iterator of Frames.

from datatable import dt, fread fread('iris.csv') 
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
105.43.71.50.2setosa
114.83.41.60.2setosa
124.831.40.1setosa
134.331.10.1setosa
145.841.20.2setosa
1456.735.22.3virginica
1466.32.551.9virginica
1476.535.22virginica
1486.23.45.42.3virginica
1495.935.11.8virginica

data = ('col1,col2,col3\n' 'a,b,1\n' 'a,b,2\n' 'c,d,3') fread(data) 
col1col2col3
str32str32int32
0ab1
1ab2
2cd3

url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv" fread(url) 
yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
int32int32int32int32int32str32str32str32int32int32int32
02014111413AAJFKLAX35924759
1201411-313AAJFKLAX363247511
220141129AAJFKLAX351247519
3201411-8-26AALGAPBI15710357
420141121AAJFKLAX350247513
520141140AAEWRLAX339245418
6201411-2-18AAJFKLAX338247521
7201411-3-14AAJFKLAX356247515
8201411-1-17AAJFKMIA161108915
9201411-2-14AAJFKSEA349242218
10201411-5-17AAEWRMIA161108516
112014117-5AAJFKSFO365258617
1220141131AAJFKBOS3918712
13201411142133AAJFKLAX345247519
14201411-5-26AAJFKBOS3518717
253311201410311-30UALGAIAH201141614
25331220141031-5-14UAEWRIAH18914008
25331320141031-816MQLGARDU8343111
25331520141031-51MQLGASDF1106598

Read from an archive (if there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in):

fread("data.zip/mtcars.csv") 

Note: Use iread() if you wish to read in multiple files in an archive; an iterator of Frames is returned.

Read from .xls or .xlsx files

fread("excel.xlsx") 

For excel files, you can specify the sheet to be read:

fread("excel.xlsx/Sheet1") 
Note:
• xlrd must be installed to read in excel files.

• Use iread() if you wish to read in multiple sheets; an iterator of Frames is returned.

Read in data from the command line. Simply pass the command line statement to the cmd parameter:

# https://blog.jpalardy.com/posts/awk-tutorial-part-2/ # You specify the cmd parameter # Here we filter data for the year 2015 fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""") 

The command line can be very handy with large data; you can do some of the preprocessing before reading in the data to datatable.

## Detect Thousand Separator¶

Fread handles thousand separator, with the assumption that the separator is a ,:

fread("""Name|Salary|Position James|256,000|evangelist Ragnar|1,000,000|conqueror Loki|250360|trickster""") 
NameSalaryPosition
str32int32str32
0James256000evangelist
1Ragnar1000000conqueror
2Loki250360trickster

## Specify the Delimiter¶

You can specify the delimiter via the sep parameter. Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'0-9a-zA-Z]:

data = """ 1:2:3:4 5:6:7:8 9:10:11:12 """ 
C0C1C2C3
int32int32int32int32
01234
15678
29101112

## Dealing with Null Values and Blank Rows¶

You can pass a list of values to be treated as null, via the na_strings parameter:

data = """ ID|Charges|Payment_Method 634-VHG|28|Cheque 365-DQC|33.5|Credit card 264-PPR|631|-- 845-AJO|42.3| 789-KPO|56.9|Bank Transfer """ fread(data, na_strings=['--', '']) 
IDChargesPayment_Method
str32float64str32
0634-VHG28Cheque
1365-DQC33.5Credit card
2264-PPR631NA
3845-AJO42.3NA
4789-KPO56.9Bank Transfer

For rows with less values than in other rows, you can set fill=True; fread will fill with NA:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11') fread(data, fill=True) 
abcd
int32int32int32int32
01234
15678
291011NA

You can skip empty lines:

data = ('a,b,c,d\n' '\n' '1,2,3,4\n' '5,6,7,8\n' '\n' '9,10,11,12') fread(data, skip_blank_lines=True) 
abcd
int32int32int32int32
01234
15678
29101112

## Dealing with Column Names¶

If the data has no headers, fread will assign default column names:

data = ('1,2\n' '3,4\n') fread(data) 
C0C1
int32int32
012
134

You can pass in column names via the columns parameter:

fread(data, columns=['A','B']) 
AB
int32int32
012
134

You can change column names:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, columns=["A","B","C","D"]) 
ABCD
int32int32int32int32
01234
15678
29101112

You can change some of the column names via a dictionary:

fread(data, columns={"a":"A", "b":"B"}) 
ABcd
int32int32int32int32
01234
15678
29101112

Fread uses heuristics to determine whether the first row is data or not; occasionally it may guess incorrectly, in which case, you can set the header parameter to False:

fread(data, header=False) 
C0C1C2C3
str32str32str32str32
0abcd
11234
25678
39101112

You can pass a new list of column names as well:

fread(data, header=False, columns=["A","B","C","D"]) 
ABCD
str32str32str32str32
0abcd
11234
25678
39101112

## Row Selection¶

Fread has a skip_to_line parameter, where you can specify what line to read the data from:

data = ('skip this line\n' 'a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_line=2) 
abcd
int32int32int32int32
01234
15678
29101112

You can also skip to a line containing a particular string with the skip_to_string parameter, and start reading data from that line. Note that skip_to_string and skip_to_line cannot be combined; you can only use one:

data = ('skip this line\n' 'a,b,c,d\n' 'first, second, third, last\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_string='first') 
firstsecondthirdlast
int32int32int32int32
01234
15678
29101112

You can set the maximum number of rows to read with the max_nrows parameter:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, max_nrows=2) 
abcd
int32int32int32int32
01234
15678
data = ('skip this line\n' 'a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_line=2, max_nrows=2) 
abcd
int32int32int32int32
01234
15678

## Setting Column Type¶

You can determine the data types via the columns parameter:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # this is useful when you are interested in only a subset of the columns fread(data, columns={"a":dt.float32, "b":dt.str32}) 
abcd
float64str32int32int32
01234
15678
29101112

You can also pass in the data types by position:

fread(data, columns = [dt.int32, dt.str32, None, dt.float32]) 
abd
int32str32float64
0124
1568
291012

You can also change all the column data types with a single assignment:

fread(data, columns = dt.float32) 
abcd
float64float64float64float64
01234
15678
29101112

You can change the data type for a slice of the columns (here slice(3) is equivalent to [:3]):

# this changes the data type to float for the first three columns fread(data, columns={float:slice(3)}) 
abcd
float64float64float64int32
01234
15678
29101112

## Selecting Columns¶

There are various ways to select columns in fread :

• Select with a dictionary:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # pass Ellipsis : None or ... : None, # to discard any columns that are not needed fread(data, columns={"a":"a", ... : None}) 
a
int32
01
15
29

Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.

• Select columns with a set:

fread(data, columns={"a","b"}) 
ab
int32int32
012
156
2910
• Select range of columns with slice:

# select the second and third column fread(data, columns=slice(1,3)) 
bc
int32int32
023
167
21011
# select the first column # jump two hoops and # select the third column fread(data, columns = slice(None,3,2)) 
ac
int32int32
013
157
2911
• Select range of columns with range:

fread(data, columns = range(1,3)) 
bc
int32int32
023
167
21011
• Boolean Selection:

fread(data, columns=[False, False, True, True]) 
cd
int32int32
034
178
21112
• Select with a list comprehension:

fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols]) 
ac
int32int32
013
157
2911
• Exclude columns with None:

fread(data, columns = ['a',None,None,'d']) 
int32int32
014
158
2912
• Exclude columns with list comprehension:

fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols]) 
bd
int32int32
024
168
21012
• Drop columns by assigning None to the columns via a dictionary:

data = ("A,B,C,D\n" "1,3,5,7\n" "2,4,6,8\n") fread(data, columns={"B":None,"D":None}) 
AC
int32int32
015
126
• Drop a column and change data type:

fread(data, columns={"B":None, "C":str}) 
ACD
int32str32int32
0157
1268
• Change column name and type, and drop a column:

# pass a tuple, where the first item in the tuple is the new column name, # and the other item is the new data type. fread(data, columns={"A":("first", float), "B":None,"D":None}) 
firstC
float64int32
015
126

You can also select which columns to read dynamically, based on the names/types of the columns in the file:

def colfilter(columns): return [col.name=='species' or "length" in col.name for col in columns] fread('iris.csv', columns=colfilter, max_nrows=5) 
sepal_lengthpetal_lengthspecies
float64float64str32
05.11.4setosa
14.91.4setosa
24.71.3setosa
34.61.5setosa
451.4setosa

The same approach can be used to auto-rename columns as they are read from the file:

def rename(columns): return [col.name.upper() for col in columns] fread('iris.csv', columns=rename, max_nrows=5) `
SEPAL_LENGTHSEPAL_WIDTHPETAL_LENGTHPETAL_WIDTHSPECIES
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa