Fread Examples¶
This function is capable of reading data from a variety of input formats (text files, plain text, files embedded in archives, excel files, …), producing a Frame as the result. You can even read in data from the command line.
See fread()
for all the available parameters.
Note: If you wish to read in multiple files, use iread()
; it returns an iterator of Frames.
Read Data¶
Read from text file:
from datatable import dt, fread result = fread('iris.csv') result.head(5) sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5 3.6 1.4 0.2 setosa
Read text data directly:
data = ('col1,col2,col3\n' 'a,b,1\n' 'a,b,2\n' 'c,d,3') fread(data) col1 col2 col3 a b 1 a b 2 a b 3
Read from a url:
url = "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv" fread(url)
Read from an archive:
If there are multiple files, only the first will be read; you can specify the path to the specific file you are interested in:
fread("data.zip/mtcars.csv")
Note: Use iread()
if you wish to read in multiple files in an archive; an iterator of Frames is returned.
Read from
.xls
or.xlsx
filesfread("excel.xlsx")
For excel files, you can specify the sheet to be read:
fread("excel.xlsx/Sheet1")
- Note:
Read in data from the command line. Simply pass the command line statement to the
cmd
parameter:#https://blog.jpalardy.com/posts/awk-tutorial-part-2/ #You specify the `cmd` parameter #Here we filter data for the year 2015 fread(cmd = """cat netflix.tsv | awk 'NR==1; /^2015-/'""")
The command line can be very handy with large data; you can do some of the preprocessing before reading in the data to datatable
.
Detect Thousand Separator¶
Fread
handles thousand separator, with the assumption that the separator is a ,
:
data = """Name|Salary|Position
James|256,000|evangelist
Ragnar|1,000,000|conqueror
Loki|250360|trickster"""
fread(data)
Name Salary Position
0 James 256000 evangelist
1 Ragnar 1000000 conqueror
2 Loki 250360 trickster
Specify the Delimiter¶
You can specify the delimiter via the sep
parameter.
Note that the separator must be a single character string; non-ASCII characters are not allowed as the separator, as well as any characters in ["'`0-9a-zA-Z]
:
data = """
1:2:3:4
5:6:7:8
9:10:11:12
"""
fread(data, sep=":")
C0 C1 C2 C3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Dealing with Null Values and Blank Rows¶
You can pass a list of values to be treated as null, via the na_strings
parameter:
data = """
ID|Charges|Payment_Method
634-VHG|28|Cheque
365-DQC|33.5|Credit card
264-PPR|631|--
845-AJO|42.3|
789-KPO|56.9|Bank Transfer
"""
fread(data, na_strings=['--', ''])
ID Charges Payment_Method
0 634-VHG 28 Cheque
1 365-DQC 33.5 Credit card
2 264-PPR 631 NA
3 845-AJO 42.3 NA
4 789-KPO 56.9 Bank Transfer
For rows with less values than in other rows, you can set fill=True
; fread
will fill with NA
:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11')
fread(data, fill=True)
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NA
You can skip empty lines:
data = ('a,b,c,d\n'
'\n'
'1,2,3,4\n'
'5,6,7,8\n'
'\n'
'9,10,11,12')
fread(data, skip_blank_lines=True)
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Dealing with Column Names¶
If the data has no headers, fread
will assign default column names:
data = ('1,2\n'
'3,4\n')
fread(data)
C0 C1
0 1 2
1 3 4
You can pass in column names via the columns
parameter:
fread(data, columns=['A','B'])
A B
0 1 2
1 3 4
You can change column names:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, columns=["A","B","C","D"])
A B C D
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
You can change some of the column names via a dictionary:
fread(data, columns={"a":"A", "b":"B"})
A B c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
Fread
uses heuristics to determine whether the first row is data or not; occasionally it may guess incorrectly, in which case, you can set the header
parameter to False:
fread(data, header=False)
C0 C1 C2 C3
0 a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
You can pass a new list of column names as well:
fread(data, header=False, columns=["A","B","C","D"])
A B C D
0 a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
Row Selection¶
Fread
has a skip_to_line
parameter, where you can specify what line to read the data from:
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2)
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
You can also skip to a line containing a particular string with the skip_to_string
parameter, and start reading data from that line. Note that skip_to_string
and skip_to_line
cannot be combined; you can only use one:
data = ('skip this line\n'
'a,b,c,d\n'
'first, second, third, last\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_string='first')
first second third last
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
You can set the maximum number of rows to read with the max_nrows
parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, max_nrows=2)
a b c d
0 1 2 3 4
1 5 6 7 8
data = ('skip this line\n'
'a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
fread(data, skip_to_line=2, max_nrows=2)
a b c d
0 1 2 3 4
1 5 6 7 8
Setting Column Type¶
You can determine the data types via the columns
parameter:
data = ('a,b,c,d\n'
'1,2,3,4\n'
'5,6,7,8\n'
'9,10,11,12')
#this is useful when you are interested in only a subset of the columns
fread(data, columns={"a":dt.float32, "b":dt.str32})
You can also pass in the data types by position:
fread(data, columns = (stype.int32, stype.str32, stype.float32))
You can also change all the column data types with a single assignment:
fread(data, columns = dt.float32)
You can change the data type for a slice of the columns:
#this changes the data type to float for the first three columns
fread(data, columns={float:slice(3)})
Note that there are a small number of stypes within datatable
(int8, int16, int32, int64, float32, float64, str32, str64, obj64, bool8)
Selecting Columns¶
There are various ways to select columns in fread
:
Select with a dictionary:
data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11,12') #pass ``Ellipsis : None`` or ``... : None``, #to discard any columns that are not needed fread(data, columns={"a":"a", ... : None}) a 0 1 1 5 2 9
Selecting via a dictionary makes more sense when selecting and renaming columns at the same time.
Select columns with a set:
fread(data, columns={"a","b"}) a b 0 1 2 1 5 6 2 9 10
Select range of columns with slice:
#select the second and third column fread(data, columns=slice(1,3)) b c 0 2 3 1 6 7 2 10 11 #select the first column #jump two hoops and #select the third column fread(data, columns = slice(None,3,2)) a c 0 1 3 1 5 7 2 9 11
Select range of columns with range:
fread(data, columns = range(1,3)) b c 0 2 3 1 6 7 2 10 11
Boolean Selection:
fread(data, columns=[False, False, True, True]) c d 0 3 4 1 7 8 2 11 12
Select with a list comprehension:
fread(data, columns=lambda cols:[col.name in ("a","c") for col in cols]) a c 0 1 3 1 5 7 2 9 11
Exclude columns with None:
fread(data, columns = ['a',None,None,'d']) a d 0 1 4 1 5 8 2 9 12
Exclude columns with list comprehension:
fread(data, columns=lambda cols:[col.name not in ("a","c") for col in cols]) b d 0 2 4 1 6 8 2 10 12
Drop columns by assigning None to the columns via a dictionary:
data = ("A,B,C,D\n" "1,3,5,7\n" "2,4,6,8\n") fread(data, columns={"B":None,"D":None}) A C 0 1 5 1 2 6
Drop a column and change data type:
fread(data, columns={"B":None, "C":str}) A C D 0 1 5 7 1 2 6 8
Change column name and type, and drop a column:
#pass a tuple, where the first item in the tuple is the new column name, #and the other item is the new data type. fread(data, columns={"A":("first", float), "B":None,"D":None}) first C 0 1 5 1 2 6
With list comprehensions, you can dynamically select columns:
#select columns that have length, and species column
fread('iris.csv',
#use a boolean list comprehension to get the required columns
columns = lambda cols : [(col.name=='species')
or ("length" in col.name)
for col in cols],
max_nrows=5)
sepal_length petal_length species
0 5.1 1.4 setosa
1 4.9 1.4 setosa
2 4.7 1.3 setosa
3 4.6 1.5 setosa
4 5 1.4 setosa
#select columns by position
fread('iris.csv',
columns = lambda cols : [ind in (1,4) for ind, col in enumerate(cols)],
max_nrows=5)
sepal_length petal_length petal_width
0 5.1 1.4 0.2
1 4.9 1.4 0.2
2 4.7 1.3 0.2
3 4.6 1.5 0.2
4 5 1.4 0.2