Fread Examples¶
This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.
See fread() for all the available parameters.
Note: If you wish to read in multiple files, use iread(); it
returns an iterator of Frames.
Read data¶
Read from a text file:
from datatable import dt, fread
fread('iris.csv')
| sepal_length | sepal_width | petal_length | petal_width | species | ||
|---|---|---|---|---|---|---|
| float64 | float64 | float64 | float64 | str32 | ||
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | |
| 1 | 4.9 | 3 | 1.4 | 0.2 | setosa | |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | |
| 4 | 5 | 3.6 | 1.4 | 0.2 | setosa | |
| 5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | |
| 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | |
| 7 | 5 | 3.4 | 1.5 | 0.2 | setosa | |
| 8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | |
| 9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa | |
| 10 | 5.4 | 3.7 | 1.5 | 0.2 | setosa | |
| 11 | 4.8 | 3.4 | 1.6 | 0.2 | setosa | |
| 12 | 4.8 | 3 | 1.4 | 0.1 | setosa | |
| 13 | 4.3 | 3 | 1.1 | 0.1 | setosa | |
| 14 | 5.8 | 4 | 1.2 | 0.2 | setosa | |
| … | … | … | … | … | … | |
| 145 | 6.7 | 3 | 5.2 | 2.3 | virginica | |
| 146 | 6.3 | 2.5 | 5 | 1.9 | virginica | |
| 147 | 6.5 | 3 | 5.2 | 2 | virginica | |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | |
| 149 | 5.9 | 3 | 5.1 | 1.8 | virginica |
Read text data directly:
data = ('col1,col2,col3\n'
'a,b,1\n'
'a,b,2\n'
'c,d,3')
fread(data)
| col1 | col2 | col3 | ||
|---|---|---|---|---|
| str32 | str32 | int32 | ||
| 0 | a | b | 1 | |
| 1 | a | b | 2 | |
| 2 | c | d | 3 |
Read from a url:
url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
fread(url)
| year | month | day | dep_delay | arr_delay | carrier | origin | dest | air_time | distance | hour | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | int32 | str32 | str32 | str32 | int32 | int32 | int32 | ||
| 0 | 2014 | 1 | 1 | 14 | 13 | AA | JFK | LAX | 359 | 2475 | 9 | |
| 1 | 2014 | 1 | 1 | -3 | 13 | AA | JFK | LAX | 363 | 2475 | 11 | |
| 2 | 2014 | 1 | 1 | 2 | 9 | AA | JFK | LAX | 351 | 2475 | 19 | |
| 3 | 2014 | 1 | 1 | -8 | -26 | AA | LGA | PBI | 157 | 1035 | 7 | |
| 4 | 2014 | 1 | 1 | 2 | 1 | AA | JFK | LAX | 350 | 2475 | 13 | |
| 5 | 2014 | 1 | 1 | 4 | 0 | AA | EWR | LAX | 339 | 2454 | 18 | |
| 6 | 2014 | 1 | 1 | -2 | -18 | AA | JFK | LAX | 338 | 2475 | 21 | |
| 7 | 2014 | 1 | 1 | -3 | -14 | AA | JFK | LAX | 356 | 2475 | 15 | |
| 8 | 2014 | 1 | 1 | -1 | -17 | AA | JFK | MIA | 161 | 1089 | 15 | |
| 9 | 2014 | 1 | 1 | -2 | -14 | AA | JFK | SEA | 349 | 2422 | 18 | |
| 10 | 2014 | 1 | 1 | -5 | -17 | AA | EWR | MIA | 161 | 1085 | 16 | |
| 11 | 2014 | 1 | 1 | 7 | -5 | AA | JFK | SFO | 365 | 2586 | 17 | |
| 12 | 2014 | 1 | 1 | 3 | 1 | AA | JFK | BOS | 39 | 187 | 12 | |
| 13 | 2014 | 1 | 1 | 142 | 133 | AA | JFK | LAX | 345 | 2475 | 19 | |
| 14 | 2014 | 1 | 1 | -5 | -26 | AA | JFK | BOS | 35 | 187 | 17 | |
| … | … | … | … | … | … | … | … | … | … | … | … | |
| 253311 | 2014 | 10 | 31 | 1 | -30 | UA | LGA | IAH | 201 | 1416 | 14 | |
| 253312 | 2014 | 10 | 31 | -5 | -14 | UA | EWR | IAH | 189 | 1400 | 8 | |
| 253313 | 2014 | 10 | 31 | -8 | 16 | MQ | LGA | RDU | 83 | 431 | 11 | |
| 253314 | 2014 | 10 | 31 | -4 | 15 | MQ | LGA | DTW | 75 | 502 | 11 | |
| 253315 | 2014 | 10 | 31 | -5 | 1 | MQ | LGA | SDF | 110 | 659 | 8 |
Read from an archive (if there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in):
fread("data.zip/mtcars.csv")
Note: Use iread() if you wish to read in multiple files in an
archive; an iterator of Frames is returned.
Read from .xls or .xlsx files
fread("excel.xlsx")
For excel files, you can specify the sheet to be read:
fread("excel.xlsx/Sheet1")
- Note:
Read in data from the command line. Simply pass the command line statement to
the cmd parameter:
# https://blog.jpalardy.com/posts/awk-tutorial-part-2/
# You specify the `cmd` parameter
# Here we filter data for the year 2015
fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")
The command line can be very handy with large data; you can do some of the
preprocessing before reading in the data to datatable.
Detect Thousand Separator¶
Fread handles thousand separator, with the assumption that the separator
is a ,:
fread("""Name|Salary|Position
James|256,000|evangelist
Ragnar|1,000,000|conqueror
Loki|250360|trickster""")
| Name | Salary | Position | ||
|---|---|---|---|---|
| str32 | int32 | str32 | ||
| 0 | James | 256000 | evangelist | |
| 1 | Ragnar | 1000000 | conqueror | |
| 2 | Loki | 250360 | trickster |
Specify the Delimiter¶
You can specify the delimiter via the sep parameter.
Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]:
data = """
1:2:3:4
5:6:7:8
9:10:11:12
"""
| C0 | C1 | C2 | C3 | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
Dealing with Null Values and Blank Rows¶
You can pass a list of values to be treated as null, via the na_strings parameter:
data = """
ID|Charges|Payment_Method
634-VHG|28|Cheque
365-DQC|33.5|Credit card
264-PPR|631|--
845-AJO|42.3|
789-KPO|56.9|Bank Transfer
"""
fread(data, na_strings=['--', ''])
| ID | Charges | Payment_Method | ||
|---|---|---|---|---|
| str32 | float64 | str32 | ||
| 0 | 634-VHG | 28 | Cheque | |
| 1 | 365-DQC | 33.5 | Credit card | |
| 2 | 264-PPR | 631 | NA | |
| 3 | 845-AJO | 42.3 | NA | |
| 4 | 789-KPO | 56.9 | Bank Transfer |
For rows with less values than in other rows, you can set fill=True; fread will fill with NA:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11')
fread(data, fill=True)
| a | b | c | d | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | NA |
You can skip empty lines:
data = ('a,b,c,d\n'
'\n'
'1,2,3,4\n'
'5,6,7,8\n'
'\n'
'9,10,11,12')
fread(data, skip_blank_lines=True)
| a | b | c | d | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
Dealing with Column Names¶
If the data has no headers, fread will assign default column names:
data = ('1,2\n'
'3,4\n')
fread(data)
| C0 | C1 | ||
|---|---|---|---|
| int32 | int32 | ||
| 0 | 1 | 2 | |
| 1 | 3 | 4 |
You can pass in column names via the columns parameter:
fread(data, columns=['A','B'])
| A | B | ||
|---|---|---|---|
| int32 | int32 | ||
| 0 | 1 | 2 | |
| 1 | 3 | 4 |
You can change column names:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, columns=["A","B","C","D"])
| A | B | C | D | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
You can change some of the column names via a dictionary:
fread(data, columns={"a":"A", "b":"B"})
| A | B | c | d | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
Fread uses heuristics to determine whether the first row is data or not;
occasionally it may guess incorrectly, in which case, you can set the
header parameter to False:
fread(data, header=False)
| C0 | C1 | C2 | C3 | ||
|---|---|---|---|---|---|
| str32 | str32 | str32 | str32 | ||
| 0 | a | b | c | d | |
| 1 | 1 | 2 | 3 | 4 | |
| 2 | 5 | 6 | 7 | 8 | |
| 3 | 9 | 10 | 11 | 12 |
You can pass a new list of column names as well:
fread(data, header=False, columns=["A","B","C","D"])
| A | B | C | D | ||
|---|---|---|---|---|---|
| str32 | str32 | str32 | str32 | ||
| 0 | a | b | c | d | |
| 1 | 1 | 2 | 3 | 4 | |
| 2 | 5 | 6 | 7 | 8 | |
| 3 | 9 | 10 | 11 | 12 |
Row Selection¶
Fread has a skip_to_line parameter, where you can specify what line to
read the data from:
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2)
| a | b | c | d | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
You can also skip to a line containing a particular string with the
skip_to_string parameter, and start reading data from that line. Note that
skip_to_string and skip_to_line cannot be combined; you can only use
one:
data = ('skip this line\n'
'a,b,c,d\n'
'first, second, third, last\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_string='first')
| first | second | third | last | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
You can set the maximum number of rows to read with the max_nrows parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, max_nrows=2)
| a | b | c | d | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 |
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2, max_nrows=2)
| a | b | c | d | ||
|---|---|---|---|---|---|
| int32 | int32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 |
Setting Column Type¶
You can determine the data types via the columns parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
# this is useful when you are interested in only a subset of the columns
fread(data, columns={"a":dt.float32, "b":dt.str32})
| a | b | c | d | ||
|---|---|---|---|---|---|
| float64 | str32 | int32 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
You can also pass in the data types by position:
fread(data, columns = [dt.int32, dt.str32, None, dt.float32])
| a | b | d | ||
|---|---|---|---|---|
| int32 | str32 | float64 | ||
| 0 | 1 | 2 | 4 | |
| 1 | 5 | 6 | 8 | |
| 2 | 9 | 10 | 12 |
You can also change all the column data types with a single assignment:
fread(data, columns = dt.float32)
| a | b | c | d | ||
|---|---|---|---|---|---|
| float64 | float64 | float64 | float64 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
You can change the data type for a slice of the columns (here slice(3)
is equivalent to [:3]):
# this changes the data type to float for the first three columns
fread(data, columns={float:slice(3)})
| a | b | c | d | ||
|---|---|---|---|---|---|
| float64 | float64 | float64 | int32 | ||
| 0 | 1 | 2 | 3 | 4 | |
| 1 | 5 | 6 | 7 | 8 | |
| 2 | 9 | 10 | 11 | 12 |
Selecting Columns¶
There are various ways to select columns in fread :
Select with a dictionary:
data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') # pass ``Ellipsis : None`` or ``... : None``, # to discard any columns that are not needed fread(data, columns={"a":"a", ... : None})a int32 0 1 1 5 2 9
Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.
Select columns with a set:
fread(data, columns={"a","b"})a b int32 int32 0 1 2 1 5 6 2 9 10 Select range of columns with slice:
# select the second and third column fread(data, columns=slice(1,3))b c int32 int32 0 2 3 1 6 7 2 10 11 # select the first column # jump two hoops and # select the third column fread(data, columns = slice(None,3,2))a c int32 int32 0 1 3 1 5 7 2 9 11 Select range of columns with range:
fread(data, columns = range(1,3))b c int32 int32 0 2 3 1 6 7 2 10 11 Boolean Selection:
fread(data, columns=[False, False, True, True])c d int32 int32 0 3 4 1 7 8 2 11 12 Select with a list comprehension:
fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols])a c int32 int32 0 1 3 1 5 7 2 9 11 Exclude columns with None:
fread(data, columns = ['a',None,None,'d'])a d int32 int32 0 1 4 1 5 8 2 9 12 Exclude columns with list comprehension:
fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols])b d int32 int32 0 2 4 1 6 8 2 10 12 Drop columns by assigning None to the columns via a dictionary:
data = ("A,B,C,D\n" "1,3,5,7\n" "2,4,6,8\n") fread(data, columns={"B":None,"D":None})A C int32 int32 0 1 5 1 2 6 Drop a column and change data type:
fread(data, columns={"B":None, "C":str})A C D int32 str32 int32 0 1 5 7 1 2 6 8 Change column name and type, and drop a column:
# pass a tuple, where the first item in the tuple is the new column name, # and the other item is the new data type. fread(data, columns={"A":("first", float), "B":None,"D":None})first C float64 int32 0 1 5 1 2 6
You can also select which columns to read dynamically, based on the names/types of the columns in the file:
def colfilter(columns):
return [col.name=='species' or "length" in col.name
for col in columns]
fread('iris.csv', columns=colfilter, max_nrows=5)
| sepal_length | petal_length | species | ||
|---|---|---|---|---|
| float64 | float64 | str32 | ||
| 0 | 5.1 | 1.4 | setosa | |
| 1 | 4.9 | 1.4 | setosa | |
| 2 | 4.7 | 1.3 | setosa | |
| 3 | 4.6 | 1.5 | setosa | |
| 4 | 5 | 1.4 | setosa |
The same approach can be used to auto-rename columns as they are read from the file:
def rename(columns):
return [col.name.upper() for col in columns]
fread('iris.csv', columns=rename, max_nrows=5)
| SEPAL_LENGTH | SEPAL_WIDTH | PETAL_LENGTH | PETAL_WIDTH | SPECIES | ||
|---|---|---|---|---|---|---|
| float64 | float64 | float64 | float64 | str32 | ||
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | |
| 1 | 4.9 | 3 | 1.4 | 0.2 | setosa | |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | |
| 4 | 5 | 3.6 | 1.4 | 0.2 | setosa |