Parallel document iterators

Question

Parallel document iterators

dginev opened this issue 5 years ago · comments

I'm on a quest to maximal efficiency in traversing the arXMLiv dataset for scanning tasks on a single multi-core desktop. In my desktop case - I have 32 logical threads to use simultaneously.

The key technique to use in Rust are parallel iterators which are conveniently made available by the rayon crate.

I am working on two upgrades:

Add a parallel bridge capability to the document iterators in llamapun::data. so that documents can continue loading from the file system while others are being processed (aim for becoming I/O bound).
Add a set of parallel processing primitives to the libxml wrapper which will allow certain tasks to also multi-thread on a single Document.

In essence, I would like to have my high end CPU constantly at 3200% and to process arXiv as fast as my HDD speed allows me to read it in. This was largely motivated by #26 , which took 75 hours to provide a frequency report over the MathML usage in arXiv, which is not terrible, but worked on a single thread, and could certainly be improved. If one could traverse the entire dataset in say 5-7 hours, it would feel closer to Rust's performance capabilities.

Deyan Ginev · Answer 1 · Mon Apr 15 2019 05:14:07 GMT+0800 (China Standard Time)

Rust's libxml wrapper just got 25x faster with my release of 0.2.10, so I'm excited to be working on the llamapun side next.

(Terms and conditions apply, comparing the 0.2.9 master single-threaded, with a 32 thread parallel run of 0.2.10)

Deyan Ginev · Answer 2 · Mon Apr 15 2019 10:51:30 GMT+0800 (China Standard Time)

I have a first stab at the parallel bridge (commit).

Running on arXMLiv 08.2018, it took 75 minutes to scan the warning subset, 5 minutes for the no_problem subset, and I'm writing as the error subset is working. This is running the pre-ref word report example ( #27 ).

RAM use was stable at 4.5 GB and my threadripper saw a variable CPU load ranging from 24 to 29 threads utilized simultaneously at full load. (The memory use is easy to explain by factoring ~150 MB average footprint for a single libxml parser thread, and multiplying by 32)

This means llamapun will be able to do a basic frequency report over the entire 1.2 million documents of arXiv in 2 hours on a single high end desktop machine. That's pretty cool!

I still haven't experimented with using a parallel iterator internally to the document tree, essentially spinning up a thread for each individual paragraph, which may help get my load up to 30 threads of constant use, which is a small change to try after the current run completes.

I am also not fully satisfied with the parallelization primitives - I ended up creating somewhat redundant data structs, and the parallel filesystem walker requires a very specific closure type to work smoothly. But that can be extended/improved down the road, after the first test runs.

[Edit] 124 minutes total, officially traversed all of arXiv in 2 hours.

Deyan Ginev · Answer 3 · Mon Apr 15 2019 13:57:30 GMT+0800 (China Standard Time)

Adding the in-document parallel iteration (over paragraphs) yields another marginal improvement, down to 118 minutes total. I am still unsure if that should be the default behavior due to the overhead of reducing hashmaps towards the final result, but jotting the result down here for now.

The load was reliably 28 threads maxed out, or more, so a visible uptick from the 24 in the prior comment. A bit of an open question whether the 4 extra threads are doing useful work or working on .reduce overhead, but so far the net measurements fall in favor of the busier runtime.

Starting a PR at #29 , to flesh out the remaining details and merge.

Deyan Ginev · Answer 4 · Tue Apr 16 2019 08:35:12 GMT+0800 (China Standard Time)

The MathML report example, which took almost almost exactly 3 days to complete on 1 thread now finished in just under 3 hours (2h 48min), running stably in RAM, when on the parallel primitives. Also, unlike the ref example, it managed to keep a rather stable load of 30 threads at all times, which is intuitive given that there are many more math elements over which to parallelize, than paragraphs.

Very exciting.

real	168m4.240s
user	4990m34.044s
sys	31m40.161s

Deyan Ginev · Answer 5 · Wed Apr 17 2019 02:56:02 GMT+0800 (China Standard Time)

Newly parallel corpus_token_model ran successfully over arXMLiv 08.2018 in 2 hours and 16 min, while outputing 100 GB of token model text data.

real	136m8.449s
user	3696m19.292s
sys	27m34.441s

Feeling pretty good about merging here and filing a minor release

Deyan Ginev · Answer 6 · Wed Apr 17 2019 12:48:34 GMT+0800 (China Standard Time)

The paragraph dataset extraction also finished in exactly 2 hours, so there seems to be a very stable baseline for iterating over arXiv. It extracted a 26GB tar file, with metadata:

AMS paragraph model finished in 7205s, gathered: 
332342 documents;
12030596 paragraphs;
16735 discarded paragraphs (long words)

real	120m5.616s
user	3293m39.317s
sys	27m29.583s