If you often find yourself processing CSV files using python, you will quickly notice that, while being more comfortable, csv.DictReader
remains way slower than csv.reader
:
# To read a 1.5G CSV file:
csv.reader: 24s
csv.DictReader: 84s
casanova.reader: 25s
casanova
is therefore an attempt to stick to csv.reader
performance while still keeping a comfortable interface, still able to consider headers (even duplicate ones also, something that csv.DictReader
is incapable of) etc.
casanova
is thus a good fit for you if you need to:
- Stream large CSV files without running out of memory
- Enrich the same CSV files by outputing a similar file, all while adding, filtering and editing cells.
- Have the possibility to resume said enrichment if your process exited
- Resume even if your output does not have the same order as the input
casanova
also packs exotic utilities able to read csv files in reverse (without loading the whole file into memory and in regular O(n)
time), so you can, for instance, fetch useful information at the end of a file to restart some aborted process.
Finally, casanova
can be used as a command line tool able to evaluate python expressions (that can be parallelized if required) for each row of a given CSV file to produce typical results such as adding a column based on others etc.
The command line tool documentation can be found here.
For more generic task that don't require python evaluation, we recommend the very performant xsv
tool instead, or our own fork of the tool.
You can install casanova
with pip with the following command:
pip install casanova
If you want to be able to feed CSV files from the web to casanova
readers & enrichers you will also need to install at least urllib3
and optionally certifi
(if you want secure SSL). Nnote that a lot of python packages, including the popular requests
library, already depend on those two, so it is likely you already have them installed anyway:
# Installing them explicitly
pip install urllib3 certifi
# Installing casanova with those implicitly
pip install casanova[http]
- reader
- reverse_reader
- headers
- writer
- enricher
- indexed_enricher
- batch_enricher
- resumers
- count
- last_cell
- set_defaults
- xsv selection mini DSL
Straightforward CSV reader yielding list rows but giving some information about potential headers and their ipositions.
import casanova
with open('./people.csv') as f:
# Creating a reader
reader = casanova.reader(f)
# Getting header information
reader.fieldnames
>>> ['name', 'surname']
reader.headers
>>> Headers(name=0, surname=1)
name_pos = reader.headers.name
name_pos = reader.headers['name']
'name' in reader.headers
>>> True
# Iterating over the rows
for row in reader:
name = row[name_pos] # it's better to cache your pos outside the loop
name = row[reader.headers.name] # this works, but is slower
# Interested in a single column?
for name in reader.cells('name'):
print(name)
# Need also the current row when iterating on cells?
for row, name in reader.cells('name', with_rows=True):
print(row, name, surname)
# Want to iterate over records
# NOTE: this has a performance cost
for name, surname in reader.records('name', 'surname'):
print(name, surname)
for record in reader.records(['name', 'age']):
print(record[0])
for record in reader.records({'name': 'name', 'age': 1}):
print(record['age'])
# No headers? No problem.
reader = casanova.reader(f, no_headers=True)
# Note that you can also create a reader from a path
with casanova.reader('./people.csv') as reader:
...
# And if you need exotic encodings
with casanova.reader('./people.csv', encoding='latin1') as reader:
...
# The reader will also handle gzipped files out of the box
with casanova.reader('./people.csv.gz') as reader:
...
# If you have `urllib3` installed, casanova is also able to stream
# remote CSV file out of the box
with casanova.reader('https://mydomain.fr/some-file.csv') as reader:
...
# The reader will also accept iterables of rows
rows = [['name', 'surname'], ['John', 'Moran']]
reader = casanova.reader(rows)
# And you can of course use the typical dialect-related kwargs
reader = casanova.reader('./french-semicolons.csv', delimiter=';')
# Readers can also be closed if you want to avoid context managers
reader.close()
Arguments
- input_file str or Path or file or Iterable[list[str]]: input file given to the reader. Can be a path that will be opened for you by the reader, a file handle or even an arbitrary iterable of list rows.
- no_headers bool, optional [
False
]: set toTrue
ifinput_file
has no headers. - encoding str, optional [
utf-8
]: encoding to use to open the file ifinput_file
is a path. - dialect str or csv.Dialect, optional: CSV dialect for the reader to use. Check python standard csv module documentation for more info.
- quotechar str, optional: quote character used by CSV parser.
- delimiter str, optional: delimiter characted used by CSV parser.
- prebuffer_bytes int, optional: number of bytes of input file to prebuffer in attempt to get a total number of lines ahead of time.
- total int, optional: total number of lines to expect in file, if you already know it ahead of time. If given, the reader won't prebuffer data even if
prebuffer_bytes
was set. - multiplex casanova.Multiplexer, optional: multiplexer to use. Read this for more information.
- strip_null_bytes_on_read bool, optional [
False
]: before python 3.11, thecsv
module will raise when attempting to read a CSV file containing null bytes. If set toTrue
, the reader will strip null bytes on the fly while parsing rows. - reverse bool, optional [
False
]: whether to read the file in reverse (except for the header of course).
Properties
- total int, optional: total number of lines in the file, if known through prebuffering or through the
total
kwarg. - headers casanova.Headers, optional, optional: CSV file headers if
no_headers=False
. - empty bool: whether the given file was empty.
- fieldnames list[str], optional: list representing the CSV file headers if
no_headers=False
. - row_len int: expected number of items per row.
Methods
- rows: returns an iterator over the reader rows. Same as iterating over the reader directly.
- cells: take the name of a column or its position and returns an iterator over values of the given column. Can be given
with_rows=True
if you want to iterate over avalue, row
tuple instead if required. - enumerate: resuming-safe enumeration over rows yielding
index, row
tuples. Takes an optionalstart
kwarg like builtinenumerate
. - enumerate_cells: resuming-safe enumeration over cells yielding
index, cell
orindex, row, cell
if givenwith_rows=True
. Takes an optionalstart
kwarg like builtinenumerate
. - wrap: method taking a list row and returning a
RowWrapper
object to wrap it. - close: cleans up the reader resources manually when not using the dedicated context manager. It is usually only useful when the reader was given a path and not an already opened file handle.
Multiplexing
Sometimes, one column of your CSV file might contain multiple values, separated by an arbitrary separator character such as |
.
In this case, it might be desirable to "multiplex" the file by making a reader emit one copy of the line with each of the values contained by a cell.
To do so, casanova
exposes a special Multiplexer
object you can give to any reader like so:
import casanova
# Most simple case: a column named "colors", separated by "|"
reader = casanova.reader(
input_file,
multiplex=casanova.Multiplexer('colors')
)
# Customizing the separator:
reader = casanova.reader(
input_file,
multiplex=casanova.Multiplexer('colors', separator='$')
)
# Renaming the column on the fly:
reader = casanova.reader(
input_file,
multiplex=casanova.Multiplexer('colors', new_column='color')
)
A reverse CSV reader might sound silly, but it can be useful in some scenarios. Especially when you need to read the last line from an output file without reading the whole thing first, in constant time.
It is mostly used by casanova
resumers and it is unlikely you will need to use them on your own.
import casanova
# people.csv looks like this
# name,surname
# John,Doe,
# Mary,Albert
# Quentin,Gold
with open('./people.csv', 'rb') as f:
reader = casanova.reverse_reader(f)
reader.fieldnames
>>> ['name', 'surname']
next(reader)
>>> ['Quentin', 'Gold']
A class representing the headers of a CSV file. It is useful to find the row position of some columns and perform complex selection.
import casanova
# Headers can be instantiated thusly
headers = casanova.headers(['name', 'surname', 'age'])
# But you will usually use a reader or an enricher's one:
headers = casanova.reader(input_file).headers
# Accessing a column through attributes
headers.surname
>>> 1
# Accessing a column by indexing:
headers['surname']
>>> 1
# Getting a column
headers.get('surname')
>>> 1
headers.get('not-found')
>>> None
# Getting a duplicated column name
casanova.headers(['surname', 'name', 'name'])['name', 1]
>>> 2
casanova.headers(['surname', 'name', 'name']).get('name', index=1)
>>> 2
# Asking if a column exists:
'name' in headers:
>>> True
# Retrieving fieldnames:
headers.fieldnames
>>> ['name', 'surname', 'age']
# Iterating over headers
for col in headers:
print(col)
# Couting columns:
len(headers)
>>> 3
# Retrieving the nth header:
headers.nth(1)
>>> 'surname'
# Wraping a row
headers.wrap(['John', 'Matthews', '45'])
>>> RowWrapper(name='John', surname='Matthews', age='45')
# Selecting some columns (by name and/or index)):
headers.select(['name', 2])
>>> [0, 2]
# Selecting using xsv mini DSL:
headers.select('name,age')
>>> [0, 2]
headers.select('!name')
>>> [1, 2]
For more info about xsv mini DSL, check this part of the documentation.
casanova
also exports a csv writer. It can automatically write headers when needed and is able to resume some tasks.
import casanova
with open('output.csv') as f:
writer = casanova.writer(f, fieldnames=['name', 'surname'])
writer.writerow(['John', 'Davis'])
# If you want to write headers yourself:
writer = casanova.writer(f, fieldnames=['name', 'surname'], write_header=False)
writer.writeheader()
Arguments
- output_file file or casanova.Resumer: target file.
- fieldnames Iterable[str], optional: column names.
- strip_null_bytes_on_write bool, optional [
False
]: whether to strip null bytes when writing rows. Note that on python 3.10, there is a bug that prevents acsv.writer
will raise an error when attempting to write a row containing a null byte. - dialect csv.Dialect or str, optional: dialect to use to write CSV.
- delimiter str, optional: CSV delimiter.
- quotechar str, optional: CSV quoting character.
- quoting csv.QUOTE_*, optional: CSV quoting strategy.
- escapechar str, optional: CSV escaping character.
- lineterminator str, optional: CSV line terminator.
- write_header bool, optional [
True
]: whether to automatically write header if required (takes resuming into account).
Properties
- headers casanova.Headers, optional, optional: CSV file headers if fieldnames were provided
- fieldnames list[str], optional: provided fieldnames.
Resuming
A casanova.writer
is able to resume through a LastCellResumer
.
casanova
enrichers are basically a smart combination of both a reader and a writer.
It can be used to transform a given CSV file. This means you can transform its values on the fly, select some columns to keep from input and add new ones very easily.
Note that enrichers inherits from both casanova.reader
and casanova.writer
and therefore keep both their properties and methods.
import casanova
with open('./people.csv') as input_file, \
open('./enriched-people.csv', 'w') as output_file:
enricher = casanova.enricher(input_file, output_file)
# The enricher inherits from casanova.reader
enricher.fieldnames
>>> ['name', 'surname']
# You can iterate over its rows
name_pos = enricher.headers.name
for row in enricher:
# Editing a cell, so that everyone is called John now
row[name_pos] = 'John'
enricher.writerow(row)
# Want to add columns?
enricher = casanova.enricher(f, of, add=['age', 'hair'])
for row in enricher:
enricher.writerow(row, ['34', 'blond'])
# Want to keep only some columns from input?
enricher = casanova.enricher(f, of, add=['age'], select=['surname'])
for row in enricher:
enricher.writerow(row, ['45'])
# Want to select columns to keep using xsv mini dsl?
enricher = casanova.enricher(f, of, select='!1-4')
# You can of course still use #.cells etc.
for row, name in enricher.cells('name', with_rows=True):
print(row, name)
Arguments
- input_file file or str: file object to read or path to open.
- output_file file or Resumer: file object to write.
- no_headers bool, optional [
False
]: set toTrue
ifinput_file
has no headers. - encoding str, optional [
utf-8
]: encoding to use to open the file ifinput_file
is a path. - add Iterable[str|int], optional: names of columns to add to output.
- select Iterable[str|int]|str, optional: selection of columns to keep from input. Can be an iterable of column names and/or column positions or a selection string writting in xsv mini DSL.
- dialect str or csv.Dialect, optional: CSV dialect for the reader to use. Check python standard csv module documentation for more info.
- quotechar str, optional: quote character used by CSV parser.
- delimiter str, optional: delimiter characted used by CSV parser.
- prebuffer_bytes int, optional: number of bytes of input file to prebuffer in attempt to get a total number of lines ahead of time.
- total int, optional: total number of lines to expect in file, if you already know it ahead of time. If given, the reader won't prebuffer data even if
prebuffer_bytes
was set. - multiplex casanova.Multiplexer, optional: multiplexer to use. Read this for more information.
- reverse bool, optional [
False
]: whether to read the file in reverse (except for the header of course). - strip_null_bytes_on_read bool, optional [
False
]: before python 3.11, thecsv
module will raise when attempting to read a CSV file containing null bytes. If set toTrue
, the reader will strip null bytes on the fly while parsing rows. - strip_null_bytes_on_write bool, optional [
False
]: whether to strip null bytes when writing rows. Note that on python 3.10, there is a bug that prevents acsv.writer
will raise an error when attempting to write a row containing a null byte. - writer_dialect csv.Dialect or str, optional: dialect to use to write CSV.
- writer_delimiter str, optional: CSV delimiter for writer.
- writer_quotechar str, optional: CSV quoting character for writer.
- writer_quoting csv.QUOTE_*, optional: CSV quoting strategy for writer.
- writer_escapechar str, optional: CSV escaping character for writer.
- writer_lineterminator str, optional: CSV line terminator for writer.
- write_header bool, optional [
True
]: whether to automatically write header if required (takes resuming into account).
Properties
- total int, optional: total number of lines in the file, if known through prebuffering or through the
total
kwarg. - headers casanova.Headers, optional, optional: CSV file headers if
no_headers=False
. - empty bool: whether the given file was empty.
- fieldnames list[str], optional: list representing the CSV file headers if
no_headers=False
. - row_len int: expected number of items per row.
- output_headers casanova.Headers, optional, optional: output CSV headers if
no_headers=False
. - output_fieldnames list[str], optional: list representing the output CSV headers if
no_headers=False
. - added_count int: number of columns added to the output.
Resuming
A casanova.enricher
is able to resume through a RowCountResumer
or a LastCellComparisonResumer
.
Sometimes, you might want to process multiple input rows concurrently. This can mean that you will emit rows in an arbitrary order, different from the input one.
This is fine, of course, but if you still want to be able to resume an aborted process efficiently (using the `IndexedResumer), your output will need specific additions for it to work, namely a column containing the index of an output row in the original input.
casanova.indexed_enricher
makes it simpler by providing a tailored writerow
method and iterators always provided the index of a row safely.
Note that such resuming is only possible if one row in the input will produce exactly one row in the output.
import casanova
with open('./people.csv') as f, \
open('./enriched-people.csv', 'w') as of:
enricher = casanova.indexed_enricher(f, of, add=['age', 'hair'])
for index, row in enricher:
enricher.writerow(index, row, ['67', 'blond'])
for index, value in enricher.cells('name'):
...
for index, row, value in enricher.cells('name', with_rows=True):
...
Arguments
Everything from casanova.enricher
plus:
- index_column str, optional [
index
]: name of the automatically added index column.
Resuming
A casanova.indexed_enricher
is able to resume through a IndexedResumer
.
Sometimes, you might want to process a CSV file and paginate API calls per row. This means that each row of your input file should produce multiple new lines, which will be written in batch each time one call from the API returns.
Sometimes, the pagination might be quite long (think collecting the Twitter followers of a very popular account), and it would not be a good idea to accumulate all the results for a single row before flushing them to file atomically because if something goes wrong, you will lose a lot of work.
But if you still want to be able to resume if process is aborted, you will need to add some things to your output. Namely, a column containing optional "cursor" data to resume your API calls and an "end" symbol indicating we finished the current input row.
import casanova
with open('./twitter-users.csv') as input_file, \
casanova.BatchResumer('./output.csv') as output_file:
enricher = casanova.batch_resumer(input_file, output_file)
for row in enricher:
for results, next_cursor in paginate_api_calls(row):
# NOTE: if we reached the end, next_cursor is None
enricher.writebatch(row, results, next_cursor)
Arguments
Everything from casanova.enricher
plus:
- cursor_column str, optional [
cursor
]: name of the cursor column to add. - end_symbol str, optional [
end
]: unambiguous (from cursor) end symbol to mark end of input row processing.
Resuming
A casanova.batch_enricher
is able to resume through a BatchResumer
.
Through handy Resumer
classes, casanova
lets its enrichers and writers resume an aborted process.
Those classes must be used as a wrapper to open the output file and can assess whether resuming is actually useful or not for you.
All resumers act like file handles, can be used as a context manager using the with
keyword and can be manually closed using the close
method if required.
Finally know that resumers should work perfectly well with multiplexing
The RowCountResumer
works by counting the number of line of the output and skipping that many lines from the input.
It can only work in 1-to-1 scenarios where you only emit a single row per input row.
It works in O(2n) => O(n)
time and O(1)
memory, n
being the number of already processed rows.
It is only supported by casanova.enricher
.
import casanova
with open('input.csv') as input_file, \
casanova.RowCountResumer('output.csv') as resumer:
# Want to know if we can resume?
resumer.can_resume()
# Want to know how many rows were already done?
resumer.already_done_count()
# Giving the resumer to an enricher as if it was the output file
enricher = casanova.enricher(input_file, resumer)
casanova
exports an indexed resumer that allows row to be processed concurrently and emitted in a different order.
In this precise case, couting the rows is not enough and we need to be smarter.
One way to proceed is to leverage the index column added by the indexed enricher to compute a set of already processed row while reading the output. Then we can just skip the input rows whose indices are in this set.
The issue here is that this consumes up to O(n)
memory, which is prohibitive in some use cases.
To make sure this still can be done while consuming very little memory, casanova
uses an exotic data structure we named a "contiguous range set".
This means we can resume operation in O(n + log(h) * n)) => O(log(h) * n)
time and O(log(h))
memory, n
being the number of already processed rows and h
being the size of the largest hole in the sorted indices of those same rows. Note that most of the time h << n
since the output is mostly sorted (albeit not at a local level).
You can read more about this data structure in this blog post.
Note finally this resumer can only work in 1-to-1 scenarios where you only emit a single row per input row.
It is supported by casanova.indexed_enricher
only.
import casanova
with open('input.csv') as input_file, \
casanova.IndexedResumer('output.csv') as resumer:
# Want to know if we can resume?
resumer.can_resume()
# Want to know how many rows were already done?
resumer.already_done_count()
# Giving the resumer to an enricher as if it was the output file
enricher = casanova.indexed_enricher(input_file, resumer)
# If you want to use casanova ContiguousRangeSet for whatever reason
from casanova import ContiguousRangeSet
todo...
Sometimes you might write an output CSV file while performing some paginated action. Said action could be aborted and you might want to resume it where you left it.
The LastCellResumer
therefore enables you to resume writing a CSV file by reading its output's last row using a casanova.reverse_reader
and extracting the value you need to resume in constant time and memory.
It is only supported by casanova.writer
.
import casanova
with casanova.LastCellResumer('output.csv', value_column='user_id') as resumer:
# Giving the resumer to a writer as if it was the output file
writer = casanova.writer(resumer)
# Extracting last relevant value if any, so we can properly resume
last_value = resumer.get_state()
In some scenarios, it is possible to resume the operation of an enricher if you can know what was the last value of some column emitted in the output.
Fortunately, using casanova.reverse_reader
, one can read the last line of a CSV file in constant time.
As such the LastCellComparisonResumer
enables you to resume the work of an enricher in O(n)
time and O(1)
memory, with n
being the number of already done lines that you must quickly skip when repositioning yourself in the input.
Note that it only works when the enricher emits a single line per line in the input and when the considered column value is unique across the input file.
It is only supported by casanova.enricher
.
import casanova
with open('input.csv') as input_file, \
casanova.LastCellComparisonResumer('output.csv', value_colum='user_id') as resumer:
# Giving the resumer to an enricher as if it was the output file
enricher = casanova.enricher(input_file, resumer)
casanova
exposes a helper function that one can use to quickly count the number of lines in a CSV file.
import casanova
count = casanova.count('./people.csv')
# You can also stop reading the file if you go beyond a number of rows
count = casanova.count('./people.csv', max_rows=100)
>>> None # if the file has more than 100 rows
>>> 34 # else the actual count
# Any additional kwarg will be passed to the underlying reader as-is
count = casanova.count('./people.csv', delimiter=';')
casanova
exposes a helper function using a reverse_reader to read only the last cell value from a given column of a CSV file.
import casanova
last_cell = casanova.last_cell('./people.csv', column='name')
>>> 'Quentin'
# Will return None if the file is empty
last_cell = casanova.last_cell('./empty.csv', column='name')
>>> None
# Any additional kwarg will be passed to the underlying reader as-is
last_cell = casanova.last_cell('./people.csv', column='name', delimiter=';')
casanova.set_defaults
lets you edit global defaults for casanova
:
import casanova
casanova.set_defaults(strip_null_bytes_on_read=True)
# As a context manager:
with casanova.temporary_defaults(strip_null_bytes_on_read=True):
...
Arguments
- strip_null_bytes_on_read bool, optional [
False
]: should readers and enrichers strip null bytes on read? - strip_null_bytes_on_write bool, optional [
False
]: should writers and enrichers strip null bytes on write? - prebuffer_bytes int, optional: default prebuffer bytes for readers and enrichers.
xsv, a command line tool written in Rust to handle csv files, uses a clever mini DSL to let users specify column selections.
casanova
has a working python implementation of this mini DSL that can be used by the headers.select
method and the enrichers select
kwargs.
Here is the gist of it (copied right from xsv documentation itself):
Select one column by name:
* name
Select one column by index (1-based):
* 2
Select the first and fourth columns:
* 1,4
Select the first 4 columns (by index and by name):
* 1-4
* Header1-Header4
Ignore the first 2 columns (by range and by omission):
* 3-
* '!1-2'
Select the third column named 'Foo':
* 'Foo[2]'
Re-order and duplicate columns arbitrarily:
* 3-1,Header3-Header1,Header1,Foo[2],Header1
Quote column names that conflict with selector syntax:
* '"Date - Opening","Date - Actual Closing"'