b-r-u / osmpbf

A Rust library for reading the OpenStreetMap PBF file format (*.osm.pbf).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

collect buffering blobs into memory

opened this issue · comments

Both for_each and par_map_reduce collect all the results into a vector before providing any results to the callback.

let blobs = self.blob_iter.collect::<Result<Vec<_>>>()?;

To my understanding, this means that all the data from the pbf file is buffered into memory, but these pbf files can get really big (planet.osm.pbf is ~40+GB). It looks like those blobs are doing file reads but I'm not sure if it's only for header info or it's also reading the payloads into memory too.

But in any case it seems better to change

for blob in &blobs {
    match blob.decode() {

into:

for blob in self.blob_iter {
    match blob?.decode() {

unless I'm missing something? I can put together a PR if you'd like with this change and a change to par_map_reduce which I need to dig through rayon docs a bit to figure out.

Another API that might be good here is to return an iterator that does the decoding serially or in parallel. Maybe something like reader.iter() and reader.par_iter() and then you can use whichever methods the std or rayon defines on iterators from there.

Ok I did some more experimentation and it does indeed seem to try to buffer everything into memory. I ran a small test script on an old 34G planet.osm.pbf and the program got up to using 7G of RAM after about a 1m30s with no results when I stopped the program. But with these changes to for_each in the previous comment I start to get results after about 10 seconds and the memory usage stays low.

Thanks, this is a great PR! Using rayon's par_bridge (which was introduced in 2018) was on my to-do list for a long time. I'm glad the planet file can now be parsed in parallel without swapping!

Also, I'm open to your proposed API that returns iterators. That would be way nicer for users. I made two attempts at creating an iterator of elements for ElementReader but failed because I had to create a self-referential struct. The problem is that each Element needs a reference to its PrimitiveBlock which needs to be stored during iteration. But maybe I was missing something obvious?

Could you publish the merged version to crates? It should be totally API compatible, so a minor version bump should do.

0.1.16 is published!

Thanks!