A library for reading and writing tabular data (csv/xls/json/etc).
- Supports most common tabular formats: CSV, XLS, ODS, JSON, Google Sheets, SQL, and others. See complete list below.
- Loads local and remote data: Supports HTTP and FTP.
- Low memory usage: Only the current row is kept in memory, so you can large datasets.
- Supports compressed files: Using ZIP or GZIP algorithms.
- Extensible: You can add support for custom file formats and loaders (e.g. FTP).
- Features
- Getting started
- Documentation
- API Reference
- Contributing
- Changelog
$ pip install tabulatorTabulator ships with a simple CLI called tabulator to read tabular data. For
example:
$ tabulator https://github.com/frictionlessdata/tabulator-py/raw/4c1b3943ac98be87b551d87a777d0f7ca4904701/data/table.csv.gz
id,name
1,english
2,中国人You can see all supported options by running tabulator --help.
from tabulator import Stream
with Stream('data.csv', headers=1) as stream:
stream.headers # [header1, header2, ..]
for row in stream:
print(row) # [value1, value2, ..]You can find other examples in the examples directory.
In the following sections, we'll walk through some usage examples of this library. All examples were tested with Python 3.6, but should run fine with Python 3.3+.
The Stream class represents a tabular stream. It takes the file path as the
source argument. For example:
<scheme>://path/to/file.<format>
It uses this path to determine the file format (e.g. CSV or XLS) and scheme
(e.g. HTTP or postgresql). It also supports format extraction from URLs like http://example.com?format=csv. If necessary, you also can define these explicitly.
Let's try it out. First, we create a Stream object passing the path to a CSV file.
import tabulator
stream = tabulator.Stream('data.csv')At this point, the file haven't been read yet. Let's open the stream so we can read the contents.
try:
stream.open()
except tabulator.TabulatorException as e:
pass # Handle exceptionThis will open the underlying data stream, read a small sample to detect the
file encoding, and prepare the data to be read. We catch
tabulator.TabulatorException here, in case something goes wrong.
We can now read the file contents. To iterate over each row, we do:
for row in stream.iter():
print(row) # [value1, value2, ...]The stream.iter() method will return each row data as a list of values. If
you prefer, you could call stream.iter(keyed=True) instead, which returns a
dictionary with the column names as keys. Either way, this method keeps only a
single row in memory at a time. This means it can handle handle large files
without consuming too much memory.
If you want to read the entire file, use stream.read(). It accepts the same
arguments as stream.iter(), but returns all rows at once.
stream.reset()
rows = stream.read()Notice that we called stream.reset() before reading the rows. This is because
internally, tabulator only keeps a pointer to its current location in the file.
If we didn't reset this pointer, we would read starting from where we stopped.
For example, if we ran stream.read() again, we would get an empty list, as
the internal file pointer is at the end of the file (because we've already read
it all). Depending on the file location, it might be necessary to download the
file again to rewind (e.g. when the file was loaded from the web).
After we're done, close the stream with:
stream.close()The entire example looks like:
import tabulator
stream = tabulator.Stream('data.csv')
try:
stream.open()
except tabulator.TabulatorException as e:
pass # Handle exception
for row in stream.iter():
print(row) # [value1, value2, ...]
stream.reset() # Rewind internal file pointer
rows = stream.read()
stream.close()It could be rewritten to use Python's context manager interface as:
import tabulator
try:
with tabulator.Stream('data.csv') as stream:
for row in stream.iter():
print(row)
stream.reset()
rows = stream.read()
except tabulator.TabulatorException as e:
passThis is the preferred way, as Python closes the stream automatically, even if some exception was thrown along the way.
The full API documentation is available as docstrings in the Stream source code.
On this section, we'll see all different options supported by the Stream
class.
By default, tabulator considers that all file rows are values (i.e. there is no header).
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.headers # None
stream.read() # [['name', 'age'], ['Alex', 21]]If you have a header row, you can use the headers argument with the its row
number (starting from 1).
# Integer
with Stream([['name', 'age'], ['Alex', 21]], headers=1) as stream:
stream.headers # ['name', 'age']
stream.read() # [['Alex', 21]]You can also pass a lists of strings to define the headers explicitly:
with Stream([['Alex', 21]], headers=['name', 'age']) as stream:
stream.headers # ['name', 'age']
stream.read() # [['Alex', 21]]Tabulator also supports multiline headers for the xls and xlsx formats.
with Stream('data.xlsx', headers=[1, 3], fill_merged_cells=True) as stream:
stream.headers # ['header from row 1-3']
stream.read() # [['value1', 'value2', 'value3']]You can specify the file encoding (e.g. utf-8 and latin1) via the encoding
argument.
with Stream(source, encoding='latin1') as stream:
stream.read()If this argument isn't set, Tabulator will try to infer it from the data. If you
get a UnicodeDecodeError while loading a file, try setting the encoding to
utf-8.
Tabulator supports both ZIP and GZIP compression methods. By default it'll infer from the file name:
with Stream('http://example.com/data.csv.zip') as stream:
stream.read()You can also set it explicitly:
with Stream('data.csv.ext', compression='gz') as stream:
stream.read()- filename: filename in zip file to process (default is first file)
The Stream class raises tabulator.exceptions.FormatError if it detects HTML
contents. This helps avoiding the relatively common mistake of trying to load a
CSV file inside an HTML page, for example on GitHub.
You can disable this behaviour using the allow_html option:
with Stream(source_with_html, allow_html=True) as stream:
stream.read() # no exception on openTo detect the file's headers, and run other checks like validating that the file
doesn't contain HTML, Tabulator reads a sample of rows on the stream.open()
method. This data is available via the stream.sample property. The number of
rows used can be defined via the sample_size parameters (defaults to 100).
with Stream(two_rows_source, sample_size=1) as stream:
stream.sample # only first row
stream.read() # first and second rowsYou can disable this by setting sample_size to zero. This way, no data will be
read on stream.open().
Tabulator needs to read a part of the file to infer its encoding. The
bytes_sample_size arguments controls how many bytes will be read for this
detection (defaults to 10000).
source = 'data/special/latin1.csv'
with Stream(source) as stream:
stream.encoding # 'iso8859-2'You can disable this by setting bytes_sample_size to zero, in which case it'll
use the machine locale's default encoding.
When True, tabulator will ignore columns that have blank headers (defaults to
False).
# Default behaviour
source = 'text://header1,,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1) as stream:
stream.headers # ['header1', '', 'header3']
stream.read(keyed=True) # {'header1': 'value1', '': 'value2', 'header3': 'value3'}
# Ignoring columns with blank headers
source = 'text://header1,,header3\nvalue1,value2,value3'
with Stream(source, format='csv', headers=1, ignore_blank_headers=True) as stream:
stream.headers # ['header1', 'header3']
stream.read(keyed=True) # {'header1': 'value1', 'header3': 'value3'}When True, all rows' values will be converted to strings (defaults to
False). None values will be converted to empty strings.
# Default behaviour
with Stream([['string', 1, datetime.datetime(2017, 12, 1, 17, 00)]]) as stream:
stream.read() # [['string', 1, datetime.dateime(2017, 12, 1, 17, 00)]]
# Forcing rows' values as strings
with Stream([['string', 1]], force_strings=True) as stream:
stream.read() # [['string', '1', '2017-12-01 17:00:00']]When True, don't raise an exception when parsing a malformed row, but simply
return an empty row. Otherwise, tabulator raises
tabulator.exceptions.SourceError when a row can't be parsed. Defaults to False.
# Default behaviour
with Stream([[1], 'bad', [3]]) as stream:
stream.read() # raises tabulator.exceptions.SourceError
# With force_parse
with Stream([[1], 'bad', [3]], force_parse=True) as stream:
stream.read() # [[1], [], [3]]List of row numbers and/or strings to skip. If it's a string, all rows that begin with it will be skipped (e.g. '#' and '//'). If it's the empty string, all rows that begin with an empty column will be skipped.
source = [['John', 1], ['Alex', 2], ['#Sam', 3], ['Mike', 4], ['John', 5]]
with Stream(source, skip_rows=[1, 2, -1, '#']) as stream:
stream.read() # [['Mike', 4]]List of functions that can filter or transform rows after they are parsed. These
functions receive the extended_rows containing the row's number, headers
list, and the row values list. They then process the rows, and yield or discard
them, modified or not.
def skip_odd_rows(extended_rows):
for row_number, headers, row in extended_rows:
if not row_number % 2:
yield (row_number, headers, row)
def multiply_by_two(extended_rows):
for row_number, headers, row in extended_rows:
doubled_row = list(map(lambda value: value * 2, row))
yield (row_number, headers, doubled_row)
rows = [
[1],
[2],
[3],
[4],
]
with Stream(rows, post_parse=[skip_odd_rows, multiply_by_two]) as stream:
stream.read() # [[4], [8]]These functions are applied in order, as a simple data pipeline. In the example
above, multiply_by_two just sees the rows yielded by skip_odd_rows.
The methods stream.iter() and stream.read() accept the keyed and
extended flag arguments to modify how the rows are returned.
By default, every row is returned as a list of its cells values:
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.read() # [['Alex', 21]]With keyed=True, the rows are returned as dictionaries, mapping the column names to their values in the row:
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.read(keyed=True) # [{'name': 'Alex', 'age': 21}]And with extended=True, the rows are returned as a tuple of (row_number, headers, row), there row_number is the current row number (starting from 1),
headers is a list with the headers names, and row is a list with the rows
values:
with Stream([['name', 'age'], ['Alex', 21]]) as stream:
stream.read(extended=True) # (1, ['name', 'age'], ['Alex', 21])The default scheme, a file in the local filesystem.
stream = Stream('data.csv')In Python 2,
tabulatorcan't stream remote data sources because of a limitation in the underlying libraries. The whole data source will be loaded to the memory. In Python 3 there is no such problem and remote files are streamed.
stream = Stream('https://example.com/data.csv')- http_session - a
requests.Sessionobject. Read more in the requests docs. - http_stream - Enables or disables HTTP streaming, when possible (enabled by default). Disable it if you'd like to preload the whole file into memory.
The source is a file-like Python object.
with open('data.csv') as fp:
stream = Stream(fp)The source is a string containing the tabular data. Both scheme and format
must be set explicitly, as it's not possible to infer them.
stream = Stream(
'name,age\nJohn, 21\n',
scheme='text',
format='csv'
)In this section, we'll describe the supported file formats, and their respective configuration options and operations. Some formats only support read operations, while others support both reading and writing.
stream = Stream('data.csv', delimiter=',')It supports all options from the Python CSV library. Check their documentation for more information.
Tabulator is unable to stream
xlsfiles, so the entire file is loaded in memory. Streaming is supported forxlsxfiles.
stream = Stream('data.xls', sheet=1)- sheet: Sheet name or number (starting from 1)
- fill_merged_cells: if
Trueit will unmerge and fill all merged cells by a visible value. With this option enabled the parser can't stream data and load the whole document into memory. - preserve_formatting: if
Trueit will try to preserve text formatting of numeric and temporal cells returning it as strings according to how it looks in a spreadsheet (EXPERIMETAL)
This format is not included to package by default. To use it please install
tabulatorwith anodsextras:$ pip install tabulator[ods]
Source should be a valid Open Office document.
stream = Stream('data.ods', sheet=1)- sheet: Sheet name or number (starting from 1)
A publicly-accessible Google Spreadsheet.
stream = Stream('https://docs.google.com/spreadsheets/d/<id>?usp=sharing')
stream = Stream('https://docs.google.com/spreadsheets/d/<id>edit#gid=<gid>')Any database URL supported by sqlalchemy.
stream = Stream('postgresql://name:pass@host:5432/database', table='data')- table (required): Database table name
- order_by: SQL expression for row ordering (e.g.
name DESC)
This format is not included to package by default. You can enable it by installing tabulator using
pip install tabulator[datapackage].
stream = Stream('datapackage.json', resource=1)- resource: Resource name or index (starting from 0)
Either a list of lists, or a list of dicts mapping the column names to their respective values.
stream = Stream([['name', 'age'], ['John', 21], ['Alex', 33]])
stream = Stream([{'name': 'John', 'age': 21}, {'name': 'Alex', 'age': 33}])JSON document containing a list of lists, or a list of dicts mapping the column
names to their respective values (see the inline format for an example).
stream = Stream('data.json', property='key1.key2')- property: JSON Path to the property containing the tabular data. For example, considering the JSON
{"response": {"data": [...]}}, thepropertyshould be set toresponse.data.
stream = Stream('data.ndjson')stream = Stream('data.tsv')Tabulator is written with extensibility in mind, allowing you to add support for new tabular file formats, schemes (e.g. ssh), and writers (e.g. MongoDB). There are three components that allow this:
- Loaders
- Loads a stream from some location (e.g. ssh)
- Parsers
- Parses a stream of tabular data in some format (e.g. xls)
- Writers
- Writes tabular data to some destination (e.g. MongoDB)
In this section, we'll see how to write custom classes to extend any of these components.
You can add support for a new scheme (e.g. ssh) by creating a custom loader.
Custom loaders are implemented by inheriting from the Loader class, and
implementing its methods. This loader can then be used by Stream to load data
by passing it via the custom_loaders={'scheme': CustomLoader} argument.
The skeleton of a custom loader looks like:
from tabulator import Loader
class CustomLoader(Loader):
options = []
def __init__(self, bytes_sample_size, **options):
pass
def load(self, source, mode='t', encoding=None):
# load logic
with Stream(source, custom_loaders={'custom': CustomLoader}) as stream:
stream.read()You can see examples of how the loaders are implemented by looking in the
tabulator.loaders module.
You can add support for a new file format by creating a custom parser. Similarly
to custom loaders, custom parsers are implemented by inheriting from the
Parser class, and implementing its methods. This parser can then be used by
Stream to parse data by passing it via the custom_parsers={'format': CustomParser} argument.
The skeleton of a custom parser looks like:
from tabulator import Parser
class CustomParser(Parser):
options = []
def __init__(self, loader, force_parse, **options):
self.__loader = loader
def open(self, source, encoding=None):
# open logic
def close(self):
# close logic
def reset(self):
# reset logic
@property
def closed(self):
return False
@property
def extended_rows(self):
# extended rows logic
with Stream(source, custom_parsers={'custom': CustomParser}) as stream:
stream.read()You can see examples of how parsers are implemented by looking in the
tabulator.parsers module.
You can add support to write files in a specific format by creating a custom
writer. The custom writers are implemented by inheriting from the base Writer
class, and implementing its methods. This writer can then be used by Stream to
write data via the custom_writers={'format': CustomWriter} argument.
The skeleton of a custom writer looks like:
from tabulator import Writer
class CustomWriter(Writer):
options = []
def __init__(self, **options):
pass
def write(self, source, target, headers=None, encoding=None):
# write logic
with Stream(source, custom_writers={'custom': CustomWriter}) as stream:
stream.save(target)You can see examples of how parsers are implemented by looking in the
tabulator.writers module.
You can check if a source can be loaded by tabulator using the validate function.
from tabulator import validate, exceptions
try:
tabular = validate('data.csv')
except exceptions.SchemeError:
# The file scheme isn't supported
except exceptions.FormatError:
# The file format isn't supportedAll the exceptions thrown by tabulator inherit from
tabulator.exceptions.TabulatorException, so you can use it as a way to catch
any tabulator exception. You can learn about the other exceptions thrown by
looking into the tabulator.exceptions module.
The API reference is written as docstrings in the tabulator classes. A good place to start is the Stream class, which manages all loading and parsing of data files.
This project follows the Open Knowledge International coding standards.
We recommend you to use virtualenv to isolate this project from the rest of the
packages in your machine.
To install the project and its development dependencies, run:
$ make installTo run the tests, use:
$ make testHere described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.
- Implemented experimental
preserve_formattingfor xlsx
- Added support for specifying filename in zip source
Updated behaviour:
- For
odsformat the boolean, integer and datatime native types are detected now
Updated behaviour:
- For
xlsformat the boolean, integer and datatime native types are detected now
Updated behaviour:
- Added support for Python 3.7
New API added:
skip_rowssupport for an empty string to skip rows with an empty first column
New API added:
- Format will be extracted from URLs like
http://example.com?format=csv
Updated behaviour:
- Now
xlsbooleans will be parsed as booleans not integers
New API added:
- The
skip_rowsargument now supports negative numbers to skip rows starting from the end
Updated behaviour:
- Instead of raising an exception, a
UserWarningwarning will be emitted if an option isn't recognized.
New API added:
- Added
http_sessionargument for thehttp/httpsformat (it usesrequestsnow) - Added support for multiline headers:
headersargument accept ranges like[1,3]
New API added:
- Added support for compressed files i.e.
zipandgzon Python3 - The
Streamconstructor now accepts acompressionargument - The
http/httpsscheme now accepts ahttp_streamflag
Improved behaviour:
- The
headersargument allows to set the order for keyed sources and cherry-pick values
New API added:
- Formats
XLS/XLSX/ODSsupports sheet names passed via thesheetargument - The
Streamconstructor accepts anignore_blank_headersoption
Improved behaviour:
- Rebased
datapackageformat ondatapackage@1library
New API added:
- Argument
sourcefor theStreamconstructor can be apathlib.Path
New API added:
- Argument
bytes_sample_sizefor theStreamconstructor
Improved behaviour:
- Updated encoding name to a canonical form
New API added:
stream.schemestream.formatstream.encoding
Promoted provisional API to stable API:
Loader(custom loaders)Parser(custom parsers)Writer(custom writers)validate
Improved behaviour:
- Autodetect common CSV delimiters
New API added:
- Added
fill_merged_cellsoption toxls/xlsxformats
New API added:
- published
Loader/Parser/WriterAPI - Added
Streamargumentforce_strings - Added
Streamargumentforce_parse - Added
Streamargumentcustom_writers
Deprecated API removal:
- removed
topenandTable- useStreaminstead - removed
Streamargumentsloader/parser_options- use**optionsinstead
Provisional API changed:
- Updated the
Loader/Parser/WriterAPI - please use an updated version
Provisional API added:
- Unofficial support for
Streamargumentscustom_loaders/parsers