Working with Large CSV Files
I need to parse relatively large (tens to hundreds of GB) CSV files. These files are too large to fit in memory, but not really large enough for "big data" - Hadoop, etc. So, I prefer to process them on a single machine.
Problem Description
The CSV format is relatively simple:
- pipe-delimited
- no quotes
- no escapes
The above means that one row occupies a single line, and the delimiter never occurs inside a cell value.
The tasks I need to solve are, in approximate order of increasing difficulty:
- Histogram of row size (how many columns each row has)
- For each column:
- Number and ratio of non-empty values
- The maximum, minimum and mean lengths of the values
- Number of unique values (this is the hard one)
The machine this needs to run on only has Py2, but I'll try to keep this as version-agnostic as possible. I'll be using a proprietary sample data file:
bash-3.2$ ls -lh sampledata.csv
-rw-r--r--+ 1 misha staff 362M Sep 24 22:09 sampledata.csv
It's around 400MB, has 97 columns and close to 700M rows.
Questions
- What is the bottleneck? Is it I/O or parsing?
- What is the fastest way to parse CSV?
- Does it matter if you're reading a file from disk or from a pipe?
- Is it faster to work with bytes instead of Unicode?
- Does the version of Python (2 or 3) make a difference?
Answers
What is the Bottleneck?
Let's start with a "dumb" parser that splits the input into lines then columns and see how well it does:
bash-3.2$ pv sampledata.csv | kernprof -v -l read.py --reader dumb
362MiB 0:00:27 [13.2MiB/s] [==================================================>] 100%
Counter({98: 699182})
[0, 474061, 474069, 474061, 233726, 639752, 43879, 43879, 272219, 0, 232697, 352034, 506889, 419834, 238963, 253763, 587626, 0, 267186, 267186, 270990, 435364, 206037, 206037, 458097, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 0, 523037, 0, 652521, 525829, 528151, 590963, 650090, 309059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 583518, 583518, 0, 0, 0, 0, 632934, 632934, 0, 0, 0, 0, 403372, 403372, 0, 0, 0, 0, 682333, 682333, 147462, 179818, 146352, 215427, 166945, 351125, 335831, 201459, 681185, 0, 9, 192561, 192562, 609841, 664372, 664676, 657087, 657113, 471408, 471411, 464545, 570827, 570827, 535957, 535957]
Wrote profile results to read.py.lprof
Timer unit: 1e-06 s
Total time: 4.89751 s
File: read.py
Function: dumb_reader at line 11
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 @profile
12 def dumb_reader(fin, **kwargs):
13 1 3 3.0 0.0 delimiter = kwargs.get('delimiter', ',')
14 699184 641110 0.9 13.1 for line in fin:
15 699183 661942 0.9 13.5 stripped = line.rstrip('\n')
16 699183 3314306 4.7 67.7 split = stripped.split(delimiter)
17 699183 280150 0.4 5.7 yield split
The above output suggests that we spend the majority of our time parsing (67.7%), hinting that we may have a CPU bottleneck. Indeed, if we look at the CPU usage of our process while it's running, it's close to 100%. Unless our dumb parser is somehow extremely defective, we can conclude that we have a CPU bottleneck in this particular case.
What is the Fastest Way to Parse CSV?
We have several options:
- We could roll our own, like we did above.
- The standard library has a csv module.
- Numpy and Pandas also have their own CSV readers
- Any others?
Someone has compared a variety of options on StackExchange. Their conclusions were that numpy was the clear winner, followed by pandas. Unfortunately, np.fromfile requires a very specific CSV format incompatible with our requirements, and the more robust np.loadtxt is notoriously slow.
So, let's try pandas:
def pandas_reader(fin, **kwargs):
delimiter = kwargs.get('delimiter', ',')
names = fin.readline().rstrip('\n').split(delimiter)
data_types = {name: str for name in names}
data = pd.read_csv(fin, delimiter=delimiter, header=None, names=names, dtype=data_types,
quoting=csv.QUOTE_NONE, escapechar=None, engine='c')
for index, series in data.iterrows():
yield series.tolist()
This yields the entire thing into memory, which is something we want to avoid, but we can work on that in the future. Let's time it:
bash-3.2$ time pv sampledata.csv | python read.py --reader pandas
362MiB 0:00:10 [33.9MiB/s] [==================================================>] 100%
Counter({98: 699181})
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
real 1m40.432s
user 1m36.533s
sys 0m2.762s
That's rather slow... Pandas was quick to gobble up the file (10 seconds) but took a long time to process it (1min 40s). (Did we do something wrong?)
Let's try the standard library's parser:
bash-3.2$ time pv sampledata.csv | python read.py --reader stdlib
362MiB 0:00:27 [13.4MiB/s] [==================================================>] 100%
Counter({98: 699182})
[0, 474061, 474069, 474061, 233726, 639752, 43879, 43879, 272219, 0, 232697, 352034, 506889, 419834, 238963, 253763, 587626, 0, 267186, 267186, 270990, 435364, 206037, 206037, 458097, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 0, 523037, 0, 652521, 525829, 528151, 590963, 650090, 309059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 583518, 583518, 0, 0, 0, 0, 632934, 632934, 0, 0, 0, 0, 403372, 403372, 0, 0, 0, 0, 682333, 682333, 147462, 179818, 146352, 215427, 166945, 351125, 335831, 201459, 681185, 0, 9, 192561, 192562, 609841, 664372, 664676, 657087, 657113, 471408, 471411, 464545, 570827, 570827, 535957, 535957]
real 0m27.086s
user 0m26.214s
sys 0m0.944s
Our dumb parser:
bash-3.2$ time pv sampledata.csv | python read.py --reader dumb
362MiB 0:00:23 [15.3MiB/s] [==================================================>] 100%
Counter({98: 699182})
[0, 474061, 474069, 474061, 233726, 639752, 43879, 43879, 272219, 0, 232697, 352034, 506889, 419834, 238963, 253763, 587626, 0, 267186, 267186, 270990, 435364, 206037, 206037, 458097, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 0, 523037, 0, 652521, 525829, 528151, 590963, 650090, 309059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 583518, 583518, 0, 0, 0, 0, 632934, 632934, 0, 0, 0, 0, 403372, 403372, 0, 0, 0, 0, 682333, 682333, 147462, 179818, 146352, 215427, 166945, 351125, 335831, 201459, 681185, 0, 9, 192561, 192562, 609841, 664372, 664676, 657087, 657113, 471408, 471411, 464545, 570827, 570827, 535957, 535957]
real 0m23.671s
user 0m23.294s
sys 0m0.788s
So the results so far are:
- Our home-brewed parser: 23.7s
- Standard library: 27.0s
- Pandas: 1min 40s
Our simple but dumb parser ended up being the fastest, followed closely by the standard library's. It's not really a fair comparison, because the standard library's parser is much more robust - it handles quoting, escape characters, multi-line rows, etc. Nevertheless, for our limited purposes, the simple parser will do fine.
Does It Matter If You're Reading from A File Or A Pipe?
bash-3.2$ time python read.py --reader dumb --file sampledata.csv
Counter({98: 699182})
[0, 474061, 474069, 474061, 233726, 639752, 43879, 43879, 272219, 0, 232697, 352034, 506889, 419834, 238963, 253763, 587626, 0, 267186, 267186, 270990, 435364, 206037, 206037, 458097, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 0, 523037, 0, 652521, 525829, 528151, 590963, 650090, 309059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 583518, 583518, 0, 0, 0, 0, 632934, 632934, 0, 0, 0, 0, 403372, 403372, 0, 0, 0, 0, 682333, 682333, 147462, 179818, 146352, 215427, 166945, 351125, 335831, 201459, 681185, 0, 9, 192561, 192562, 609841, 664372, 664676, 657087, 657113, 471408, 471411, 464545, 570827, 570827, 535957, 535957]
real 0m24.319s
user 0m23.855s
sys 0m0.333s
This is about the same time as with the pipe, so the answer to this question is no. There are advantages and disadvantages to both options:
- When reading from a pipe you can read output from other processes, but can't seek around.
- When reading from a file, you can seek around it, but you're stuck with that file being on disk.
The best option is to handle both, if you can, and let someone else decide what's best for them.
Is It Faster To Work With Bytes or Unicode?
CSV is typically a text format. However, in our particular case, we may get away with treating it as binary, because our separator is a pipe, which has the same value (character code) regardless of whether it's encoded as e.g. UTF-8. If we go down this path, then we'll be calculating the byte length of the values, not the character length. The alternative is to decode the binary data as UTF-8 prior to CSV parsing:
def dumb_unicode(fin, **kwargs):
delimiter = kwargs.get('delimiter', ',')
for line in fin:
decoded = line.decode('utf-8')
stripped = decoded.rstrip(u'\n')
split = stripped.split(delimiter)
yield split
but this comes at a price:
bash-3.2$ time pv sampledata.csv | python read.py --reader dumb_unicode
362MiB 0:00:27 [ 13MiB/s] [==================================================>] 100%
Counter({98: 699182})
[0, 474061, 474069, 474061, 233726, 639752, 43879, 43879, 272219, 0, 232697, 352034, 506889, 419834, 238963, 253763, 587626, 0, 267186, 267186, 270990, 435364, 206037, 206037, 458097, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 0, 523037, 0, 652521, 525829, 528151, 590963, 650090, 309059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 583518, 583518, 0, 0, 0, 0, 632934, 632934, 0, 0, 0, 0, 403372, 403372, 0, 0, 0, 0, 682333, 682333, 147462, 179818, 146352, 215427, 166945, 351125, 335831, 201459, 681185, 0, 9, 192561, 192562, 609841, 664372, 664676, 657087, 657113, 471408, 471411, 464545, 570827, 570827, 535957, 535957]
real 0m27.874s
user 0m27.449s
sys 0m0.843s
This price amounts to approx. a 10% increase in processing time.
Does the Version of Python (2 or 3) Make a Difference?
If we were using the standard library's parser, then yes, it'd make a difference because that parser decodes binary data first (bytes to Unicode), which can cost. If we were using our own parser, then it wouldn't matter much, as long as we make sure there are no implicit conversions between bytes and Unicode, because Python 3 doesn't like those.
Halfway Summary
- What is the bottleneck? CPU.
- What is the fastest way to parse CSV? Write our own simple parser.
- Does it matter if you're reading a file from disk or from a pipe? No.
- Is it faster to work with bytes instead of Unicode? Yes.
- Does the version of Python (2 or 3) make a difference? Maybe.
Can You Make It Go Faster?
When parsing CSV, we have a CPU-bound problem. Even when Python (CPython, to be more precise) runs on a multi-core machine, because of the GIL, each Python process can only use a single core. So if we can let our other cores join the party, processing should happen faster. Our processing consists of the following steps:
- Read bytes (I/O bound)
- Parse CSV (CPU bound)
- Count non-empty values, etc. (CPU bound)
The bottleneck is steps 2 and 3, so offloading them to multiple processes makes sense:
Bytes -> Reader -> Processor 1 -> Collator
-> Processor 2 ->
-> ...
-> Processor N ->
Let's see how this implementation goes:
bash-3.2$ time pv sampledata.csv | python multiread.py
362MiB 0:00:20 [17.4MiB/s] [==================================================>] 100%
Counter({98: 699182})
[0, 474061, 474069, 474061, 233726, 639752, 43879, 43879, 272219, 0, 232697, 352034, 506889, 419834, 238963, 253763, 587626, 0, 267186, 267186, 270990, 435364, 206037, 206037, 458097, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 582415, 0, 523037, 0, 652521, 525829, 528151, 590963, 650090, 309059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 583518, 583518, 0, 0, 0, 0, 632934, 632934, 0, 0, 0, 0, 403372, 403372, 0, 0, 0, 0, 682333, 682333, 147462, 179818, 146352, 215427, 166945, 351125, 335831, 201459, 681185, 0, 9, 192561, 192562, 609841, 664372, 664676, 657087, 657113, 471408, 471411, 464545, 570827, 570827, 535957, 535957]
real 0m21.673s
user 1m3.031s
sys 0m13.219s
This kept all of our 4 cores busy: 100% use during the running time of the program. However, it actually doesn't buy us that much: only 1-2s, which is a modest increase. This is because multiprocessing also comes at the price of additional I/O overhead between the subprocesses.
Is it worth it? In this particular case, no, not really. However, if we were doing some more CPU-intensive processing, then the benefit of using additional CPU cores would outweight the cost of I/O overhead. Let's make our processor more feature-complete, and keep a track of the maximum, minimum and average lengths.
bash-3.2$ time pv sampledata.csv | python multiread.py > /dev/null
362MiB 0:00:57 [ 6.3MiB/s] [==================================================>] 100%
real 1m0.263s
user 3m29.807s
sys 0m9.886s
bash-3.2$ time pv sampledata.csv | python read.py --reader dumb > /dev/null
362MiB 0:01:33 [3.89MiB/s] [==================================================>] 100%
real 1m33.120s
user 1m31.859s
sys 0m1.486s
We've slightly increased the computational complexity of our processing. The difference in execution time is now much more visible: on our 4-core laptop, the multiprocessing version is 1.5 times faster. I expect that as the processing gets more CPU-hungry (there is still more to do), this lead will continue to increase.
Can You Make It Go Even Faster?
Let's have a look what's taking up the most time:
bash-3.2$ time pv sampledata.csv | kernprof -v -l multiread.py
... snip ...
Wrote profile results to multiread.py.lprof
Timer unit: 1e-06 s
Total time: 375.052 s
File: multiread.py
Function: read at line 17
Line # Hits Time Per Hit % Time Line Contents
==============================================================
17 @profile
18 def read(line_queue, header, result_queue):
19 1 29 29.0 0.0 counter = collections.Counter()
20 99 119 1.2 0.0 fill_count = [0 for _ in header]
21 99 103 1.0 0.0 max_len = [0 for _ in header]
22 99 120 1.2 0.0 min_len = [sys.maxint for _ in header]
23 99 73 0.7 0.0 sum_len = [0 for _ in header]
24 699183 431777 0.6 0.1 while True:
25 699183 4638839 6.6 1.2 line = line_queue.get()
26 699183 528746 0.8 0.1 if line is _SENTINEL:
27 1 1 1.0 0.0 break
28 699182 5794840 8.3 1.5 row = parse_line(line)
29 699182 558756 0.8 0.1 row_len = len(row)
30 699182 1169373 1.7 0.3 counter[row_len] += 1
31 699182 574065 0.8 0.2 if row_len != len(header):
32 continue
33 69219018 46546899 0.7 12.4 for j, column in enumerate(row):
34 68519836 47239480 0.7 12.6 col_len = len(column)
35 68519836 67689088 1.0 18.0 max_len[j] = max(max_len[j], col_len)
36 68519836 67689050 1.0 18.0 min_len[j] = min(min_len[j], col_len)
37 68519836 56056394 0.8 14.9 sum_len[j] += col_len
38 68519836 44426354 0.6 11.8 if col_len > 0:
39 38433446 31707478 0.8 8.5 fill_count[j] += 1
40 1 445 445.0 0.0 result_queue.put((counter, fill_count, max_len, min_len, sum_len))
real 10m34.739s
user 10m25.871s
sys 0m4.919s
We're spending a lot of time in the inner loop. Is there any way we can speed this up? One idea is to look at batching up the max, min and sum operations. This would require keeping column lengths in memory.
bash-3.2$ time pv sampledata.csv | kernprof -v -l multiread.py
... snip...
Wrote profile results to multiread.py.lprof
Timer unit: 1e-06 s
Total time: 190.55 s
File: multiread.py
Function: read at line 17
Line # Hits Time Per Hit % Time Line Contents
==============================================================
17 @profile
18 def read(line_queue, header, result_queue):
19 1 41 41.0 0.0 counter = collections.Counter()
20 99 159 1.6 0.0 column_lengths = [list() for _ in header]
21 699183 672771 1.0 0.4 while True:
22 699183 8128471 11.6 4.3 line = line_queue.get()
23 699183 852492 1.2 0.4 if line is _SENTINEL:
24 1 0 0.0 0.0 break
25 699182 9638751 13.8 5.1 row = parse_line(line)
26 699182 888174 1.3 0.5 row_len = len(row)
27 699182 2236587 3.2 1.2 counter[row_len] += 1
28 699182 860060 1.2 0.5 if row_len != len(header):
29 continue
30 69219018 66830359 1.0 35.1 for j, column in enumerate(row):
31 68519836 92805650 1.4 48.7 column_lengths[j].append(len(column))
32
33 1 1 1.0 0.0 fill_count, min_len, max_len, sum_len = zip(
34 99 7632206 77093.0 4.0 *[(l.count(0), min(l), max(l), sum(l)) for l in column_lengths]
35 )
36 1 3872 3872.0 0.0 result_queue.put((counter, fill_count, max_len, min_len, sum_len))
real 5m9.481s
user 4m21.393s
sys 0m6.887s
This is significant. We've nearly halved our execution time, at the expense of storing all the column lengths in memory. We keep these benefits when we stop profiling and go back to using multiple cores:
bash-3.2$ time python multiread.py < sampledata.csv > /dev/null
real 0m30.520s
user 1m27.877s
sys 0m13.709s
But what about the price we paid? We're keeping the length of each column in memory. We have hundreds of columns (1e2), and potentially hundreds of millions (1e8) of rows. This means we'll be keeping tens of billions (1e10) of integers in memory. Python integers are a whopping 24 bytes, so we could need trillions (1e12) of bytes. This is a pretty rough estimate, as it doesn't take into account some cool things like integer interning. But it still sounds little bit more than what we have available, so... what do we do next?
But Wait, How Do You Know This Thing Still Works?
We've been doing something very naughty: writing code without writing tests. It's time we redeem ourselves and write some:
bash-3.2$ py.test test.py -q
..
2 passed in 0.33 seconds
This way, we know that our refactorings don't break anything down the road. As a bonus, the tests also caught several bugs in multiread.py.
Memory Profiling
We can use pympler to tell us the true size of our column length lists:
from pympler.asizeof import asizeof
logging.info('num_rows: %r asizeof(column_lengths): %.2f MB',
sum(counter.values()), asizeof(column_lengths) / 1024**2)
This costs time to calculate, but it's worth knowing at the moment:
bash-3.2$ time pv sampledata.csv | python multiread.py > /dev/null
362MiB 0:00:25 [14.3MiB/s] [==================================================>] 100%
INFO:root:num_rows: 174431 asizeof(column_lengths): 139.21 MB
INFO:root:num_rows: 174025 asizeof(column_lengths): 139.20 MB
INFO:root:num_rows: 175471 asizeof(column_lengths): 139.21 MB
INFO:root:num_rows: 175255 asizeof(column_lengths): 139.20 MB
real 1m18.618s
user 4m33.919s
sys 0m13.292s
Wow, at this rate, we'll be paying 1GB of memory for each 1M rows. Let's see if we can cut that down a bit.
We don't really need to keep a list of all the length of each columns. A tally (length, number of columns) will do. We can implement that easily using a collections.Counter and observe a significant drop in memory usage:
bash-3.2$ time pv sampledata.csv | python multiread.py > /dev/null
362MiB 0:00:38 [9.32MiB/s] [==================================================>] 100%
INFO:root:num_rows: 174287 asizeof(column_lengths): 0.34 MB
INFO:root:num_rows: 175428 asizeof(column_lengths): 0.34 MB
INFO:root:num_rows: 175052 asizeof(column_lengths): 0.37 MB
INFO:root:num_rows: 174415 asizeof(column_lengths): 0.34 MB
real 0m40.787s
user 2m10.863s
sys 0m11.074s
This is after making sure our tests still pass, of course :)
However, once we disable pympler, we'll find that our code runs slower than before, when everything was in memory:
bash-3.2$ time python multiread.py < sampledata.csv > /dev/null
real 0m36.809s
user 2m4.985s
sys 0m9.345s
If we profile our read function again, we'll see why:
bash-3.2$ time pv sampledata.csv | kernprof -v -l multiread.py
... snip...
Line # Hits Time Per Hit % Time Line Contents
==============================================================
29 @profile
30 def read(line_queue, header, result_queue):
31 1 18 18.0 0.0 logging.debug('args: %r', locals())
32 1 15 15.0 0.0 counter = collections.Counter()
33 99 596 6.0 0.0 column_lengths = [collections.Counter() for _ in header]
34 699183 475593 0.7 0.3 while True:
35 699183 2239169 3.2 1.4 line = line_queue.get()
36 699183 539042 0.8 0.3 if line is _SENTINEL:
37 1 0 0.0 0.0 break
38 699182 5573801 8.0 3.5 row = parse_line(line)
39 699182 3890694 5.6 2.5 logging.debug('row: %r', row)
40 699182 567693 0.8 0.4 row_len = len(row)
41 699182 905227 1.3 0.6 counter[row_len] += 1
42 699182 597672 0.9 0.4 if row_len != len(header):
43 continue
44 69219018 49692721 0.7 31.6 for j, column in enumerate(row):
45 68519836 92626690 1.4 59.0 column_lengths[j][len(column)] += 1
... snip...
Updating the counters one by one is sub-optimal. Perhaps if we cached a few rows in memory before updating our counters, things'd be a bit faster?
Let's write a new class to abstract away our counter-updating details:
class BufferingCounter(object):
def __init__(self, header, maxbufsize=10000):
self._header = header
self._maxbufsize = maxbufsize
self._counters = [collections.Counter() for _ in self._header]
self._buffer = [list() for _ in self._header]
self._bufsize = 0
def add_row(self, row):
for j, column in enumerate(row):
self._buffer[j].append(len(column))
self._bufsize += 1
if self._bufsize % self._maxbufsize == 0:
self.flush_buffer()
def flush_buffer(self):
for j, values in enumerate(self._buffer):
self._counters[j].update(values)
self._buffer = [list() for _ in self._header]
self._bufsize = 0
Looks good on paper, but the results aren't what we expected - it's actually slower than before:
bash-3.2$ time python multiread.py < sampledata.csv > /dev/null
real 0m52.460s
user 2m45.889s
sys 0m12.671s
The profiler tells us why:
bash-3.2$ time pv sampledata.csv | kernprof -v -l multiread.py
... snip...
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 @profile
39 def add_row(self, row):
40 69219018 34960103 0.5 21.4 for j, column in enumerate(row):
41 68519836 56867368 0.8 34.8 self._buffer[j].append(len(column))
42 699182 544912 0.8 0.3 self._bufsize += 1
43
44 699182 609252 0.9 0.4 if self._bufsize % self._maxbufsize == 0:
45 69 70347319 1019526.4 43.1 self.flush_buffer()
... snip...
Flushing the buffer and updating the underlying collections.Counter takes a surprising amount of time.
It doesn't look like this approach will work, so we're stuck with updating the Counters one-by-one :(
But What about the Hard Problem?
If you recall, the harder problem to solve was: calculate the exact number of unique values for each column.
In theory, assuming we've isolated the values for a single column, this kind of task is easy to achieve:
num_unique = len(set(all_values_in_column))
Unfortunately, this requires loading the entire column (or at least all the unique items) into memory. In our case, this number is much larger than the amount of memory we have, so we have to look at other methods.
Assuming our list is sorted, then we can calculate the number of unique items without keeping the whole thing in memory:
def count_unique(values):
num_unique = 1
prev_value = next(values)
for value in values:
if value != prev_value:
num_unique += 1
prev_value = value
return num_unique
Assuming a large list can be sorted without violating memory limits is actually not that bold an assumption. In fact, GNU sort does that sort of thing all the time: it writes temporary results to disk, and then merges them when necessary. Of course, we could implement the same thing in Python, but why bother? We could just pipe our column values into GNU sort, and read the sorted values back. So our pipeline for a single column looks like:
extract sort count_unique
CSV ----------> column ----------> sorted columns ----------> num_unique
That works for one column, but what about multiple columns? We have several options:
- Repeat the above process (extract, sort, count_unique) for each column
- Extract all columns to individual files, then run each file through sort_unique
- Anything I've missed?
The first option is attractive because it's simple, but it requires running the I/O-expensive extract step for each column. How does the cost of the extract step compare to the sort step? Let's find out.
Extract only:
bash-3.2$ for col in 1 2 3 4 5; do time python extract.py $col < sampledata.csv > /dev/null; done
real 0m5.017s
user 0m4.832s
sys 0m0.171s
real 0m5.205s
user 0m5.016s
sys 0m0.176s
real 0m5.279s
user 0m5.086s
sys 0m0.182s
Extract then sort:
bash-3.2$ for col in 1 2 3; do time bash -c "python extract.py $col < sampledata.csv | LC_ALL=C sort" > /dev/null; done
real 0m5.283s
user 0m5.091s
sys 0m0.201s
real 0m5.259s
user 0m5.065s
sys 0m0.202s
real 0m5.238s
user 0m5.047s
sys 0m0.196s
I/O is significantly more expensive that sorting: the former takes seconds, the latter takes hundreds of milliseconds. If we have a hundred columns, the extraction alone can take nearly 10 minutes, which is way too long to wait for a file with under a million rows. Let's try the second option: extract all columns to individual files, then run each file through sort and then count_unique.
Extract:
(bigcsv)bash-3.2$ time python split.py < sampledata.csv | head
gitignore/col-0.txt
gitignore/col-1.txt
gitignore/col-2.txt
gitignore/col-3.txt
gitignore/col-4.txt
gitignore/col-5.txt
... snip ...
real 0m59.243s
user 0m57.020s
sys 0m1.183s
Sort and count_unique:
bash-3.2$ time for f in gitignore/*.txt; do echo -n "$f "; LC_ALL=C sort $f | python count_unique.py; done
gitignore/col-0.txt 699182
gitignore/col-1.txt 218086
gitignore/col-10.txt 153250
gitignore/col-11.txt 38320
gitignore/col-12.txt 538
... snip...
real 0m37.239s
user 0m40.164s
sys 0m2.635s
So extracting our 97 columns took 1 min; sorting and count_unique took 40s. That's already far better than the first option we looked at, which would have taken close to 10 min.
If we wanted to speed things up, we could apply some of the tricks from above, as well as:
- Profile split.py
- Performs the sorts in parallel
If we profile, we get this:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 @profile
12 def split(fin, open_file=open_file):
13 1 10 10.0 0.0 reader = csv.reader(fin, delimiter='|')
14 1 110 110.0 0.0 header = next(reader)
15 99 215 2.2 0.0 paths = ['gitignore/col-%d.txt' % col_num for col_num, col_name in enumerate(header)]
16 99 1947987 19676.6 9.9 fouts = [open(path, 'wb') for path in paths]
17 100000 919215 9.2 4.7 for row in reader:
18 9899901 5419047 0.5 27.6 for col_num, col_value in enumerate(row):
19 9799902 11272701 1.2 57.5 fouts[col_num].write(col_value + b'\n')
20 99 200 2.0 0.0 for fout in fouts:
21 98 61326 625.8 0.3 fout.close()
22 99 128 1.3 0.0 for path in paths:
23 98 491 5.0 0.0 print(path)
The main cost here is the I/O of writing to the temporary files: it takes up over half of the execution time. It's actually so expensive that it would make the first approach attractive for cases where the column number is small.
So here we're dealing with an I/O bottleneck. We're writing to a hundred or so files synchronously, and paying the price for it. What if we could write to the files asynchronously? One way to do that, if we were using Python 3.x, would be asynchronous I/O (we might get to that later). In our world, we don't have that, but we have the next best thing: threads.
Threads effectively allow you to do more non-CPU-bound tasks in a shorter amount of wall time. They are lightweight than multiprocesses, so you can create many more of them. Because they do not bypass the limitations of the GIL, so they are best applied to cases like I/O bottlenecks.
The strategy is simple:
def writer_thread(queue_in, fpath_out, open_):
with open_(fpath_out, 'wb') as fout:
lines = True
while lines is not SENTINEL:
lines = queue_in.get()
if lines is not SENTINEL:
fout.write(b'\n'.join(lines) + b'\n')
queue_in.task_done()
We start a separate thread for each column, and that thread writes to its own file. We batch lines together for efficiency: otherwise, the overhead of working with some many queues and items becomes too high. Let's time it:
bash-3.2$ time python multisplit.py < sampledata.csv
real 0m27.865s
user 0m26.215s
sys 0m1.969s
This is almost twice as fast as the single-threaded solution. Another interesting result is that our thread-based solution chews through this file around 20s faster than the multiprocessing solution does its calculations. This is despite the fact that the former does a lot of I/O.
Putting It All Together
Let's combine our solutions, i.e. do:
- Histogram of row size (how many columns each row has)
- For each column:
- Number and ratio of non-empty values
- The maximum, minimum and mean lengths of the values
- Number of unique values (this is the hard one)
We can do the histogram as part of split - it's trivial. For the rest, we need a script that reads a sorted column and outputs the results.
Now that our output is sorted, we can easily run-length encode it. This is good because the number of runs automatically gives us the number of unique values. Furthermore, the number of runs is guaranteed to be less than or equal to the number of values, so working with runs is more efficient and convenient. Once we have our runs and run lengths, calculating the above is trivial. It's also relatively quick:
bash-3.2$ time for f in gitignore/col-*.txt; do LC_ALL=C sort $f | python summarize.py > /dev/null; done
real 1m7.055s
user 1m10.568s
sys 0m2.700s
It took 1min 7s seconds to sort and summarize all of our hundred or so columns. Taking into account the 30 seconds it took to split the original file, we can expect to process the file in under 2 min. Let's see if we can do better: start by profiling summarize.py:
Total time: 4.71134 s
File: summarize.py
Function: read_column at line 22
Line # Hits Time Per Hit % Time Line Contents
==============================================================
22 @profile
23 def read_column(iterator):
24 1 3 3.0 0.0 num_values = 0
25 1 1 1.0 0.0 num_uniques = 0
26 1 0 0.0 0.0 num_empty = 0
27 1 1 1.0 0.0 max_len = 0
28 1 0 0.0 0.0 min_len = sys.maxint
29 1 1 1.0 0.0 sum_len = 0
30
31 218087 3644832 16.7 77.4 for run_value, run_length in run_length_encode(line.rstrip(b'\n') for line in iterator):
32 218086 142615 0.7 3.0 if run_value == BLANK:
33 1 1 1.0 0.0 num_empty = run_length
34 218086 137396 0.6 2.9 num_values += run_length
35 218086 134973 0.6 2.9 num_uniques += 1
36 218086 143855 0.7 3.1 val_len = len(run_value)
37 218086 178101 0.8 3.8 max_len = max(max_len, val_len)
38 218086 170990 0.8 3.6 min_len = min(min_len, val_len)
39 218086 158565 0.7 3.4 sum_len += val_len * run_length
40
41 1 1 1.0 0.0 return {
42 1 0 0.0 0.0 'num_values': num_values,
43 1 1 1.0 0.0 'num_fills': num_values - num_empty,
44 1 1 1.0 0.0 'fill_ratio': (num_values - num_empty) / num_values,
45 1 0 0.0 0.0 'max_len': max_len,
46 1 1 1.0 0.0 'min_len': min_len,
47 1 2 2.0 0.0 'avg_len': sum_len / num_values,
48 1 1 1.0 0.0 'num_uniques': num_uniques,
49 }
Most of the time gets spent in the run_length_encode function. Drilling down:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
9 @profile
10 def run_length_encode(iterator):
11 1 2449072 2449072.0 60.2 run_value, run_length = next(iterator), 1
12 699182 667812 1.0 16.4 for value in iterator:
13 699181 297065 0.4 7.3 if value < run_value:
14 raise ValueError('unsorted iterator')
15 699181 280812 0.4 6.9 elif value != run_value:
16 218085 82619 0.4 2.0 yield run_value, run_length
17 218085 96326 0.4 2.4 run_value, run_length = value, 1
18 else:
19 481096 192319 0.4 4.7 run_length += 1
20 1 0 0.0 0.0 yield run_value, run_length
Not much we can do here. Most of the time is spent reading the iterator (in our case, it's a file). The first hit is particularly bad: it's likely because of buffering. Fortunately, this is another I/O bound problem, and we can solve it using threads. Before we do that, we have to tend to another problem: when to sort?
- Sort all columns first, and then run them through summarize.py on multiple threads
- Sort each column and pipe it to summarize.py directly
Let's look at the first option: sorting everything first.
bash-3.2$ time for f in gitignore/col*.txt; do LC_ALL=C sort $f > $f.sorted; done
real 9m17.598s
user 9m6.205s
sys 0m4.773s
Yikes! gsort is slightly faster because it uses parallelization, unlike the default sort on MacOS, but still relatively slow:
bash-3.2$ time for f in gitignore/col*.txt; do LC_ALL=C gsort $f > $f.sorted; done
real 3m40.958s
user 10m33.544s
sys 0m8.451s
Why does sorting take so long? Is it because we're writing to a file? Piping to summarize was lightning-fast.
Let's try the second option with some pseudo-code:
split
for each column:
sort column | summarize
collect summaries
clean up
There's a Python module for helping us out with the sort and pipe part. It's appropriately called pipes. We use it like this:
def sort_and_summarize(path):
template = pipes.Template()
template.append('LC_ALL=C sort', '--')
with template.open(path, 'r') as fin:
result = summarize(fin)
return result
The LC_ALL=C forces a sort by native byte values, as opposed to locale-specific characters. This gives the same result as sorting within Python.
Since we've seen that sorting and summarizing is CPU-intensive, we can spread the work across multiple cores. The result:
bash-3.2$ time python summarize.py gitignore/col-*.txt > /dev/null
real 0m36.105s
user 2m10.014s
sys 0m2.899s
Now all we need is a script that does the following:
split
for each column:
sort column | summarize
collect summaries
clean up
Let's time our end result:
bash-3.2$ time python bigcsv.py < sampledata.csv > /dev/null
real 1m20.490s
user 3m5.546s
sys 0m5.448s
1 minute and 20 seconds. It's good enough for me!
Summary
- More than one way to parse CSV - the best method depends on your application
- CSV parsing is CPU bound, but
- Splitting CSV files is I/O bound
- multiprocessing helps work around CPU-bound problems on multi-core machines
- threading helps work around I/O-bound problems
- pympler is helpful for memory profiling
- line_profiler is helpful for CPU usage profiling
- pipes module is helpful for using pipes within your Python programs
- Watch this video for an intro to Python profiling
- Watch this video for an intro to Python 3's asyncio
- Russian speakers: watch this video for a good intro to memory and Python