Fread Examples¶

This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.

See fread() for all the available parameters.

Note: If you wish to read in multiple files, use iread(); it returns an iterator of Frames.

Read Data¶

Read from text file:

from datatable import dt, fread

result = fread('iris.csv')
result.head(5)

    sepal_length    sepal_width     petal_length    petal_width     species
0   5.1                 3.5             1.4             0.2         setosa
1   4.9                 3               1.4             0.2         setosa
2   4.7                 3.2             1.3             0.2         setosa
3   4.6                 3.1             1.5             0.2         setosa
4   5                   3.6             1.4             0.2         setosa

Read text data directly:

data = ('col1,col2,col3\n'
        'a,b,1\n'
        'a,b,2\n'
        'c,d,3')

fread(data)

col1  col2  col3
 a     b     1
 a     b     2
 a     b     3

Read from a url:

url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
fread(url)

Read from an archive:

If there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in:

fread("data.zip/mtcars.csv")

Note: Use iread() if you wish to read in multiple files in an archive; an iterator of Frames is returned.

Read from .xls or .xlsx files
```
fread("excel.xlsx")
```

For excel files, you can specify the sheet to be read:

fread("excel.xlsx/Sheet1")

Note:

xlrd must be installed to read in excel files.
Use iread() if you wish to read in multiple sheets; an iterator of Frames is returned.

Read in data from the command line. Simply pass the command line statement to the cmd parameter:

#https://blog.jpalardy.com/posts/awk-tutorial-part-2/
#You specify the `cmd` parameter
#Here we filter data for the year 2015
fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")

The command line can be very handy with large data; you can do some of the preprocessing before reading in the data to datatable.

Detect Thousand Separator¶

Fread handles thousand separator, with the assumption that the separator is a ,:

data = """Name|Salary|Position
         James|256,000|evangelist
        Ragnar|1,000,000|conqueror
          Loki|250360|trickster"""

fread(data)

    Name    Salary  Position
0   James   256000  evangelist
1   Ragnar  1000000 conqueror
2   Loki    250360  trickster

Specify the Delimiter¶

You can specify the delimiter via the sep parameter. Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]:

data = """
       1:2:3:4
       5:6:7:8
       9:10:11:12
       """

fread(data, sep=":")

    C0      C1      C2      C3
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

Dealing with Null Values and Blank Rows¶

You can pass a list of values to be treated as null, via the na_strings parameter:

data = """
       ID|Charges|Payment_Method
       634-VHG|28|Cheque
       365-DQC|33.5|Credit card
       264-PPR|631|--
       845-AJO|42.3|
       789-KPO|56.9|Bank Transfer
       """

fread(data, na_strings=['--', ''])

    ID          Charges  Payment_Method
0   634-VHG     28       Cheque
1   365-DQC     33.5     Credit card
2   264-PPR     631      NA
3   845-AJO     42.3     NA
4   789-KPO     56.9     Bank Transfer

For rows with less values than in other rows, you can set fill=True; fread will fill with NA:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11')

fread(data, fill=True)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      NA

You can skip empty lines:

data = ('a,b,c,d\n'
        '\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '\n'
        '9,10,11,12')

fread(data, skip_blank_lines=True)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

Dealing with Column Names¶

If the data has no headers, fread will assign default column names:

data = ('1,2\n'
        '3,4\n')

fread(data)

    C0      C1
0   1       2
1   3       4

You can pass in column names via the columns parameter:

fread(data, columns=['A','B'])

    A       B
0   1       2
1   3       4

You can change column names:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, columns=["A","B","C","D"])

    A       B       C       D
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

You can change some of the column names via a dictionary:

fread(data, columns={"a":"A", "b":"B"})

    A       B       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

Fread uses heuristics to determine whether the first row is data or not; occasionally it may guess incorrectly, in which case, you can set the header parameter to False:

fread(data,  header=False)


    C0      C1      C2      C3
0   a       b       c       d
1   1       2       3       4
2   5       6       7       8
3   9       10      11      12

You can pass a new list of column names as well:

fread(data,  header=False, columns=["A","B","C","D"])

    A       B       C       D
0   a       b       c       d
1   1       2       3       4
2   5       6       7       8
3   9       10      11      12

Row Selection¶

Fread has a skip_to_line parameter, where you can specify what line to read the data from:

data = ('skip this line\n'
        'a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, skip_to_line=2)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

You can also skip to a line containing a particular string with the skip_to_string parameter, and start reading data from that line. Note that skip_to_string and skip_to_line cannot be combined; you can only use one:

data = ('skip this line\n'
        'a,b,c,d\n'
        'first, second, third, last\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, skip_to_string='first')


    first   second  third   last
0   1       2       3       4
1   5       6       7       8
2   9       10      11      12

You can set the maximum number of rows to read with the max_nrows parameter:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, max_nrows=2)


    a       b       c       d
0   1       2       3       4
1   5       6       7       8

data = ('skip this line\n'
        'a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

fread(data, skip_to_line=2, max_nrows=2)

    a       b       c       d
0   1       2       3       4
1   5       6       7       8

Setting Column Type¶

You can determine the data types via the columns parameter:

data = ('a,b,c,d\n'
        '1,2,3,4\n'
        '5,6,7,8\n'
        '9,10,11,12')

#this is useful when you are interested in only a subset of the columns
fread(data, columns={"a":dt.float32, "b":dt.str32})

You can also pass in the data types by position:

fread(data, columns = (stype.int32, stype.str32, stype.float32))

You can also change all the column data types with a single assignment:

fread(data, columns = dt.float32)

You can change the data type for a slice of the columns:

#this changes the data type to float for the first three columns
fread(data, columns={float:slice(3)})

Note that there are a small number of stypes within datatable (int8, int16, int32, int64, float32, float64, str32, str64, obj64, bool8)

Selecting Columns¶

There are various ways to select columns in fread :

Select with a dictionary:

    data = ('a,b,c,d\n'
            '1,2,3,4\n'
            '5,6,7,8\n'
            '9,10,11,12')

    #pass ``Ellipsis : None`` or ``... : None``,
    #to discard any columns that are not needed
    fread(data, columns={"a":"a", ... : None})

    a
0   1
1   5
2   9

Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.

Select columns with a set:

fread(data, columns={"a","b"})

    a       b
0   1       2
1   5       6
2   9       10

Select range of columns with slice:

#select the second and third column
fread(data, columns=slice(1,3))

    b       c
0   2       3
1   6       7
2   10      11

#select the first column
#jump two hoops and
#select the third column
fread(data, columns = slice(None,3,2))

    a       c
0   1       3
1   5       7
2   9       11

Select range of columns with range:

fread(data, columns = range(1,3))

    b       c
0   2       3
1   6       7
2   10      11

Boolean Selection:

fread(data, columns=[False, False, True, True])

    c       d
0   3       4
1   7       8
2   11      12

Select with a list comprehension:

fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols])

    a       c
0   1       3
1   5       7
2   9       11

Exclude columns with None:

fread(data, columns = ['a',None,None,'d'])

    a       d
0   1       4
1   5       8
2   9       12

Exclude columns with list comprehension:

fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols])


    b       d
0   2       4
1   6       8
2   10      12

Drop columns by assigning None to the columns via a dictionary:

data = ("A,B,C,D\n"
        "1,3,5,7\n"
        "2,4,6,8\n")

fread(data, columns={"B":None,"D":None})

    A       C
0   1       5
1   2       6

Drop a column and change data type:

fread(data, columns={"B":None, "C":str})

    A       C       D
0   1       5       7
1   2       6       8

Change column name and type, and drop a column:

 #pass a tuple, where the first item in the tuple is the new column name,
 #and the other item is the new data type.
fread(data, columns={"A":("first", float), "B":None,"D":None})

    first   C
0   1       5
1   2       6

With list comprehensions, you can dynamically select columns:

#select columns that have length, and species column
fread('iris.csv',
  #use a boolean list comprehension to get the required columns
  columns = lambda cols : [(col.name=='species')
                           or ("length" in col.name)
                           for col in cols],
  max_nrows=5)

  sepal_length      petal_length    species
0   5.1                 1.4         setosa
1   4.9                 1.4         setosa
2   4.7                 1.3         setosa
3   4.6                 1.5         setosa
4   5                   1.4         setosa


#select columns by position
fread('iris.csv',
       columns = lambda cols : [ind in (1,4) for ind, col in enumerate(cols)],
       max_nrows=5)

    sepal_length    petal_length    petal_width
0   5.1                     1.4         0.2
1   4.9                     1.4         0.2
2   4.7                     1.3         0.2
3   4.6                     1.5         0.2
4   5                       1.4         0.2