douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page:https://crates.io/crates/pdbtbx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add SASA support

OWissett opened this issue · comments

I have developed a Rust port of the C library, freesasa, which performs protein surface area calculations.

I want to know if you think that this can be incorporated into this library? It is already built to support PDBTBX structs.

I am happy to fork this repo and get working on making it more integrated, if people feel this would be a good thing.

Good work! I have not worked with this particular library before, but from a cursory glance at the documentation it seems to me like a nice small scoped calculation which could fit nicely into pdbtbx. If the code you wrote is really big, then maybe it would make more sense as a separate crate. If the code you wrote needs a lot of dependencies I would make it into a separate feature so users can opt out of compiling this feature in when they do not need it. But that is just personal preference.

Conclusion, I am open for it, and would be happy to get some more details!

Yea, so it is not too big but I agree, I think adding it as a feature would be good. Currently, the crate is privately hosted on my research groups GitLab. I will look at integrating at some point soon.

I am keen to get involved in developing this package, as I prefer writing Rust over Python for most of what we do. The only reason to choose python at the moment is the number of packages which are written for it, so this much change!

Also, I should say, that it isn't really a port (that was the wrong choice of words) but it is FFI bindings to the C library. So it is still using the underlying C library.

You can look at the raw FFI crate freesasa-sys here: https://lib.rs/crates/freesasa-sys

Perfect! Sounds like a good fit for a new feature for pdbtbx. You are welcome to get involved. As I am not working with protein structures on a daily basis anymore my work on pdbtbx has been sparse recently, I am working on antibody sequencing from serum with mass spec instead ( I saw you are a PhD in an antibody design group, nice coincidence). But I am more than willing to review PRs.
(Also small note, we have public holiday here and I will be away for a couple of days)

hey @OWissett, this is something I'm also interested in.

Did you move forward with the implementation here? I'd be happy to collaborate on that.

ping @OWissett 👀

@rvhonorato Hey sorry didn't see this until now. Been busy with PhD stuff.

I'm not currently working on this right now, you're welcome to give it a go.

If you go on my repos, I have the freesasa-rs crate on there. It needs some restructuring potentially, but feel free to have a go at implementing it in the pdbtbx.

IMO pdbtbx should be kept as a pure PDB and mmCIF parser without additional features to avoid bloat especially anything that depends on packages outside of the Rust ecosystem. On a related note, I recently wrote a pure Rust implementation of the Shrake-Rupley algorithm for computing SASA which can be found here.

I have now released an early alpha version of the freesasa-rs library on crates.io.

@maxall41 I see that your version only has the ability to calculate the SASA on a per atom basis? Do you aim to add the ability to get the SASA on for residues and chains easily?

Also, @maxall41 I think it would be fine to add to the library as an optional feature. Since the crate already provides more features than simply parsing, such as the R* tree atom search.

It depends on the vision of @douweschulte for this crate.

I think that potentially what is needed in this space to increase the easy of working with PDBs is to have functionality similar to that of biopython, in a single crate (which feature flags hiding what you don't need). Maybe we can look at integrating with https://github.com/rust-bio/rust-bio

@maxall41 I see that your version only has the ability to calculate the SASA on a per-atom basis? Do you aim to add the ability to get the SASA on for residues and chains easily?

You can easily calculate the per-residue SASA values by just summing the atom SASA values for each residue. Though I may add this internally just for ease of use.

Note: One thing to consider with the previous approach is that it is not deterministic for residues across different proteins because the number of atoms resolved in the structure for each residue may be different, but I haven't seen an implementation that does this differently (e.g: see https://github.com/biopython/biopython/blob/master/Bio/PDB/SASA.py) It would also theoretically be more performant to calculate SASA on a residue level instead of an atom level if you only needed it at a resiude level, but I don't think that optimization is really necessary as my implementation is already quite fast.

I wasn't meaning to perform the calculation at the residue level, but be able to present it easily for overall structures, chains, and residues. So I agree with you, that maybe just adding these are methods to your library would be good.

Have you done a speed comparison with the pure C FreeSASA library?

Also, do you support report SASA values for polar, apolar, sidechain, main chain, etc... like how FreeSASA does?

I in general approve of including more advanced features within the crate. When this encompasses features that might hamper the compile time of users I do like to put them behind features (most often this slims the number of dependencies down).

Besides that, @OWissett brought up the option of including in biorust. If they are up for it I think we could discuss it further. In the best case we make it easy for anyone to work with biological stuff in Rust and being part of a larger crate (not to mention having more maintainer potential) could be good.

I wasn't meaning to perform the calculation at the residue level, but be able to present it easily for overall structures, chains, and residues. So I agree with you, that maybe just adding these are methods to your library would be good.

Have you done a speed comparison with the pure C FreeSASA library?

Also, do you support report SASA values for polar, apolar, sidechain, main chain, etc... like how FreeSASA does?

I was able to compute SASA values for A0A2K5XT84-F1 (AlphaFold) in 40.06225ms. Doing the same with freesasa took 94 milliseconds. So it seems like my library is a good bit faster than freesasa. Used flags:

[profile.release]
lto = true
codegen-units = 1

I also finished implementing the ability to set the desired level (Atom,Residue,Chain,Protein) and you can now use it (V2.0.0 and higher). Il try and implement separate apolar and polar return values later.

EDIT (Apr 30 2024): rust-sasa now returns separate polar and apolar totals when using the SASALevel::Protein option.