collect buffering blobs into memory
opened this issue · comments
Both for_each
and par_map_reduce
collect all the results into a vector before providing any results to the callback.
Line 71 in 31a48ba
To my understanding, this means that all the data from the pbf file is buffered into memory, but these pbf files can get really big (planet.osm.pbf is ~40+GB). It looks like those blobs are doing file reads but I'm not sure if it's only for header info or it's also reading the payloads into memory too.
But in any case it seems better to change
for blob in &blobs {
match blob.decode() {
into:
for blob in self.blob_iter {
match blob?.decode() {
unless I'm missing something? I can put together a PR if you'd like with this change and a change to par_map_reduce
which I need to dig through rayon docs a bit to figure out.
Another API that might be good here is to return an iterator that does the decoding serially or in parallel. Maybe something like reader.iter()
and reader.par_iter()
and then you can use whichever methods the std or rayon defines on iterators from there.
Ok I did some more experimentation and it does indeed seem to try to buffer everything into memory. I ran a small test script on an old 34G planet.osm.pbf and the program got up to using 7G of RAM after about a 1m30s with no results when I stopped the program. But with these changes to for_each
in the previous comment I start to get results after about 10 seconds and the memory usage stays low.
Thanks, this is a great PR! Using rayon's par_bridge
(which was introduced in 2018) was on my to-do list for a long time. I'm glad the planet file can now be parsed in parallel without swapping!
Also, I'm open to your proposed API that returns iterators. That would be way nicer for users. I made two attempts at creating an iterator of elements for ElementReader
but failed because I had to create a self-referential struct. The problem is that each Element
needs a reference to its PrimitiveBlock
which needs to be stored during iteration. But maybe I was missing something obvious?
Could you publish the merged version to crates? It should be totally API compatible, so a minor version bump should do.
0.1.16 is published!
Thanks!