Fread Examples

This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.

See fread() for all the available parameters.

Note: If you wish to read in multiple files, use iread(); it returns an iterator of Frames.

Read data

Read from a text file:

from datatable import dt, fread fread('iris.csv')
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa
105.43.71.50.2setosa
114.83.41.60.2setosa
124.831.40.1setosa
134.331.10.1setosa
145.841.20.2setosa
1456.735.22.3virginica
1466.32.551.9virginica
1476.535.22virginica
1486.23.45.42.3virginica
1495.935.11.8virginica

Read text data directly:

data = ('col1,col2,col3\n' 'a,b,1\n' 'a,b,2\n' 'c,d,3') fread(data)
col1col2col3
str32str32int32
0ab1
1ab2
2cd3

Read from a url:

url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv" fread(url)
yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
int32int32int32int32int32str32str32str32int32int32int32
02014111413AAJFKLAX35924759
1201411-313AAJFKLAX363247511
220141129AAJFKLAX351247519
3201411-8-26AALGAPBI15710357
420141121AAJFKLAX350247513
520141140AAEWRLAX339245418
6201411-2-18AAJFKLAX338247521
7201411-3-14AAJFKLAX356247515
8201411-1-17AAJFKMIA161108915
9201411-2-14AAJFKSEA349242218
10201411-5-17AAEWRMIA161108516
112014117-5AAJFKSFO365258617
1220141131AAJFKBOS3918712
13201411142133AAJFKLAX345247519
14201411-5-26AAJFKBOS3518717
253311201410311-30UALGAIAH201141614
25331220141031-5-14UAEWRIAH18914008
25331320141031-816MQLGARDU8343111
25331420141031-415MQLGADTW7550211
25331520141031-51MQLGASDF1106598

Read from an archive (if there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in):

fread("data.zip/mtcars.csv")

Note: Use iread() if you wish to read in multiple files in an archive; an iterator of Frames is returned.

Read from .xls or .xlsx files

fread("excel.xlsx")

For excel files, you can specify the sheet to be read:

fread("excel.xlsx/Sheet1")
Note:
  • xlrd must be installed to read in excel files.

  • Use iread() if you wish to read in multiple sheets; an iterator of Frames is returned.

Read in data from the command line. Simply pass the command line statement to the cmd parameter:

# https://blog.jpalardy.com/posts/awk-tutorial-part-2/ # You specify the `cmd` parameter # Here we filter data for the year 2015 fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")

The command line can be very handy with large data; you can do some of the preprocessing before reading in the data to datatable.

Detect Thousand Separator

Fread handles thousand separator, with the assumption that the separator is a ,:

fread("""Name|Salary|Position James|256,000|evangelist Ragnar|1,000,000|conqueror Loki|250360|trickster""")
NameSalaryPosition
str32int32str32
0James256000evangelist
1Ragnar1000000conqueror
2Loki250360trickster

Specify the Delimiter

You can specify the delimiter via the sep parameter. Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]:

data = """ 1:2:3:4 5:6:7:8 9:10:11:12 """
>>> >>> fread(data, sep=":")
C0C1C2C3
int32int32int32int32
01234
15678
29101112

Dealing with Null Values and Blank Rows

You can pass a list of values to be treated as null, via the na_strings parameter:

data = """ ID|Charges|Payment_Method 634-VHG|28|Cheque 365-DQC|33.5|Credit card 264-PPR|631|-- 845-AJO|42.3| 789-KPO|56.9|Bank Transfer """ fread(data, na_strings=['--', ''])
IDChargesPayment_Method
str32float64str32
0634-VHG28Cheque
1365-DQC33.5Credit card
2264-PPR631NA
3845-AJO42.3NA
4789-KPO56.9Bank Transfer

For rows with less values than in other rows, you can set fill=True; fread will fill with NA:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11') fread(data, fill=True)
abcd
int32int32int32int32
01234
15678
291011NA

You can skip empty lines:

data = ('a,b,c,d\n' '\n' '1,2,3,4\n' '5,6,7,8\n' '\n' '9,10,11,12') fread(data, skip_blank_lines=True)
abcd
int32int32int32int32
01234
15678
29101112

Dealing with Column Names

If the data has no headers, fread will assign default column names:

data = ('1,2\n' '3,4\n') fread(data)
C0C1
int32int32
012
134

You can pass in column names via the columns parameter:

fread(data, columns=['A','B'])
AB
int32int32
012
134

You can change column names:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, columns=["A","B","C","D"])
ABCD
int32int32int32int32
01234
15678
29101112

You can change some of the column names via a dictionary:

fread(data, columns={"a":"A", "b":"B"})
ABcd
int32int32int32int32
01234
15678
29101112

Fread uses heuristics to determine whether the first row is data or not; occasionally it may guess incorrectly, in which case, you can set the header parameter to False:

fread(data, header=False)
C0C1C2C3
str32str32str32str32
0abcd
11234
25678
39101112

You can pass a new list of column names as well:

fread(data, header=False, columns=["A","B","C","D"])
ABCD
str32str32str32str32
0abcd
11234
25678
39101112

Row Selection

Fread has a skip_to_line parameter, where you can specify what line to read the data from:

data = ('skip this line\n' 'a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_line=2)
abcd
int32int32int32int32
01234
15678
29101112

You can also skip to a line containing a particular string with the skip_to_string parameter, and start reading data from that line. Note that skip_to_string and skip_to_line cannot be combined; you can only use one:

data = ('skip this line\n' 'a,b,c,d\n' 'first, second, third, last\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_string='first')
firstsecondthirdlast
int32int32int32int32
01234
15678
29101112

You can set the maximum number of rows to read with the max_nrows parameter:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, max_nrows=2)
abcd
int32int32int32int32
01234
15678
data = ('skip this line\n' 'a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') fread(data, skip_to_line=2, max_nrows=2)
abcd
int32int32int32int32
01234
15678

Setting Column Type

You can determine the data types via the columns parameter:

data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # this is useful when you are interested in only a subset of the columns fread(data, columns={"a":dt.float32, "b":dt.str32})
abcd
float64str32int32int32
01234
15678
29101112

You can also pass in the data types by position:

fread(data, columns = [dt.int32, dt.str32, None, dt.float32])
abd
int32str32float64
0124
1568
291012

You can also change all the column data types with a single assignment:

fread(data, columns = dt.float32)
abcd
float64float64float64float64
01234
15678
29101112

You can change the data type for a slice of the columns (here slice(3) is equivalent to [:3]):

# this changes the data type to float for the first three columns fread(data, columns={float:slice(3)})
abcd
float64float64float64int32
01234
15678
29101112

Selecting Columns

There are various ways to select columns in fread :

  • Select with a dictionary:

    data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # pass ``Ellipsis : None`` or ``... : None``, # to discard any columns that are not needed fread(data, columns={"a":"a", ... : None})
    a
    int32
    01
    15
    29

Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.

  • Select columns with a set:

    fread(data, columns={"a","b"})
    ab
    int32int32
    012
    156
    2910
  • Select range of columns with slice:

    # select the second and third column fread(data, columns=slice(1,3))
    bc
    int32int32
    023
    167
    21011
    # select the first column # jump two hoops and # select the third column fread(data, columns = slice(None,3,2))
    ac
    int32int32
    013
    157
    2911
  • Select range of columns with range:

    fread(data, columns = range(1,3))
    bc
    int32int32
    023
    167
    21011
  • Boolean Selection:

    fread(data, columns=[False, False, True, True])
    cd
    int32int32
    034
    178
    21112
  • Select with a list comprehension:

    fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols])
    ac
    int32int32
    013
    157
    2911
  • Exclude columns with None:

    fread(data, columns = ['a',None,None,'d'])
    ad
    int32int32
    014
    158
    2912
  • Exclude columns with list comprehension:

    fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols])
    bd
    int32int32
    024
    168
    21012
  • Drop columns by assigning None to the columns via a dictionary:

    data = ("A,B,C,D\n" "1,3,5,7\n" "2,4,6,8\n") fread(data, columns={"B":None,"D":None})
    AC
    int32int32
    015
    126
  • Drop a column and change data type:

    fread(data, columns={"B":None, "C":str})
    ACD
    int32str32int32
    0157
    1268
  • Change column name and type, and drop a column:

    # pass a tuple, where the first item in the tuple is the new column name, # and the other item is the new data type. fread(data, columns={"A":("first", float), "B":None,"D":None})
    firstC
    float64int32
    015
    126

You can also select which columns to read dynamically, based on the names/types of the columns in the file:

def colfilter(columns): return [col.name=='species' or "length" in col.name for col in columns] fread('iris.csv', columns=colfilter, max_nrows=5)
sepal_lengthpetal_lengthspecies
float64float64str32
05.11.4setosa
14.91.4setosa
24.71.3setosa
34.61.5setosa
451.4setosa

The same approach can be used to auto-rename columns as they are read from the file:

def rename(columns): return [col.name.upper() for col in columns] fread('iris.csv', columns=rename, max_nrows=5)
SEPAL_LENGTHSEPAL_WIDTHPETAL_LENGTHPETAL_WIDTHSPECIES
float64float64float64float64str32
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa