douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page:https://crates.io/crates/pdbtbx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multithreading support

douweschulte opened this issue · comments

As some operations over iterators van benefit greatly from multithreading git would be great if the library supported Rayon iterators. Problems could arise in combination with #57 though so that should be looked into.

Maybe this is worth looking at in this context: https://docs.rs/rayon/1.5.0/rayon/index.html
Although I assume you're aware of this.

Another comment: Since this library is built around iterators of the various structs, it might be a good idea to implement parallel variants of these with rayon which itself provides high-level parallel iterator alternatives.
This should not be too hard to implement (I think) and leaves the choice whether or not to use those to the user. Since multithreading comes with some overhead, it would require some benchmarking to see if the typical iterator size generated from PDB files is worth doing this in parallel.
What are your thoughts on this?

Those are exactly my thoughts, in this way I hope to provide multithreading to users when and only if they want to use it. I will have to figure out a bit more how to exactly get to that on the source code level, but that will be fixed in due time. If you are interested feel free to open a PR ;-)

I've been wanting to add something to the library, however, I'm still in the process of learning Rust. The project that ended up using this library is the first (and only) nontrivial piece of code I've written in this language so far.
With that said, I'd be glad to help but will probably have to ask a couple of stupid questions along the way. For example, I don't understand how this method (and others like it) works:

    pub fn chains(&self) -> impl DoubleEndedIterator<Item = &Chain> + '_ {
        self.models.iter().flat_map(|a| a.chains())
    }

It is supposed to return a double-ended Iterator with the items being references to a Chain type. But inside the flat_map the method calls itself?

It is no problem for me to help you get started, I would be glad to help other people learn Rust. So if you feel like it feel free to get started!
The abovementioned method works in the following way. The flat_map can be seen as a dedicated map(|a| ...).flatten() method invocation. The map method works the same in many other languages, it takes a n iterator of type A and runs a function on it to return type B, in this case type B is Iterator<Item=&Chain> by calling the chain method on a single model (the method is independent on how this method in implemented on the model struct). The flatten method then effectively iterates over all iterators in an iterator and so turns a type like Iterator<Item=Iterator<Item=A>> to Iterator<Item=A>, which gives the desired result of an Iterator<Item=&Chain> in this case.
I hope my explanation makes some sense, feel free to ask more question and to look at the documentation of this method in the rust docs https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.flat_map.
If you feel like you want to talk a bit about Rust code or anything related to this PR more directly feel free to email me at the email address provided on my profile page.
Good luck and I look forward to answering more of your questions!

Thanks for explaining! What I mostly didn't get was the fact that the chains method that was called in the closure is the one implemented for the model struct. I didn't realize at first because all these methods for all structs have the same name.
Now I've implemented parallel versions of these iterator constructor methods for all hierarchies using rayon but so far I see little difference in preliminary tests. I'll look into it some more.

Closed by #62.