This function is capable of reading data from a variety of input formats,
Frame as the result. The recognized formats are:
CSV, Jay, XLSX, and plain text. In addition, the data may be inside an
archive such as
The first argument to fread is the input source.
Multiple types of sources are supported and can be named
url. When the source is
not named, fread will attempt to guess its type. The most common
file, but sometimes the argument is resolved as
(if the string contains newlines) or
url (if the string starts
s3:// or similar).
A file source can be either the name of the file on disk, or a
python “file-like” object – i.e. any object having method
Generally, specifying a file name should be preferred, since
reading from a Python
file can only be done in a single-threaded
This argument also supports addressing files inside an archive,
or sheets inside an Excel workbook. Simply write the name of the
file as if the archive was a folder:
Instead of reading data from file, this argument provides the data as a simple in-memory blob.
A command that will be executed in the shell and its output then read as text.
This parameter can be used to specify the URL of the input file. The data will first be downloaded into a temporary directory and then read from there. In the end the temporary files will be removed.
A path to a public S3 bucket is also supported, however, internally it first gets converted into the corresponding https URL.
We use the standard
urllib.request module to download the
data. Changing the settings of that module, for example installing
proxy, password, or cookie managers will allow you to customize
the download process.
Limit which columns to read from the CSV file.
Field separator in the input file. If this value is
(default) then the separator will be auto-detected. Otherwise it
must be a single-character string. When
sep='\n', then the
data will be read in single-column mode. Characters
["'`0-9a-zA-Z] are not allowed as the separator, as well as
any non-ASCII characters.
Decimal point symbol for floating-point numbers.
The maximum number of rows to read from the file. Setting this parameter to any negative number is equivalent to have no limit at all. Currently this parameter doesn’t always work correctly.
True then the first line of the CSV file contains the header.
False then there is no header. By default the presence of the
header is heuristically determined from the contents of the file.
The list of strings that were used in the input file to represent NA values.
True then the lines of the CSV file are allowed to have
uneven number of fields. All missing fields will be filled with
NAs in the resulting frame.
If this parameter is provided, then the input will be recoded
from this encoding into UTF-8 before reading. Any encoding
registered with the python
codec module can be used.
Start reading the file from the line containing this string. All
previous lines will be skipped and discarded. This parameter
cannot be used together with
If this setting is given, then this many lines in the file will
be skipped before we start to parse the file. This can be used
for example when several first lines in the file contain non-CSV
data and therefore must be skipped. This parameter cannot be
used together with
True, then any empty lines in the input will be skipped. If
this parameter is
False then: (a) in single-column mode empty
lines are kept as empty lines; otherwise (b) if
empty lines produce a single line filled with NAs in the output;
otherwise (c) an
dt.exceptions.IOError is raised.
True, then the leading/trailing whitespace will be stripped
from unquoted string fields. Whitespace is always skipped from
The character that was used to quote fields in the CSV file. By
default the double-quote mark
'"' is assumed.
Use this directory for storing temporary files as needed. If not
provided then the system temporary directory will be used, as
determined via the
tempfile Python module.
Number of threads to use when reading the file. This number cannot
exceed the number of threads in the pool
0 or negative number of threads is requested, then it will be
treated as that many threads less than the maximum. By default
all threads in the thread pool are used.
True, then print detailed information about the internal
workings of fread to stdout (or to
logger if provided).
Logger object that will receive verbose information about fread’s
progress. When this parameter is specified,
verbose mode will
be turned on automatically.
Action that should be taken when the input resolves to multiple
distinct sources. By default, (
"warn") a warning will be issued
and only the first source will be read and returned as a Frame.
"ignore" action is similar, except that the extra sources
will be discarded without a warning. Lastly, an
can be raised if the value of this parameter is
If you want all sources to be read instead of only the first one
then consider using
Try not to exceed this amount of memory allocation (in bytes) when reading the data. This limit is advisory and not enforced very strictly.
This setting is useful when reading data from a file that is substantially larger than the amount of RAM available on your machine.
When this parameter is specified and fread sees that it needs more RAM than the limit in order to read the input file, then it will dump the data that was read so far into a temporary file in binary format. In the end the returned Frame will be partially composed from data located on disk, and partially from the data in memory. It is advised to either store this data as a Jay file or filter and materialize the frame (if not the performance may be slow).