douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page:https://crates.io/crates/pdbtbx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spatial query support

douweschulte opened this issue · comments

It would be great if the library supported ways to get close atoms to specified points. Proposed is to do this by making an r*tree of all atoms upon request from the user fn create_rtree(self: &PDB). For this Atom should implement Point. Caution should be taken to avoid mutability problems as the rtree will be disentangled from the PDB hierarchy.
https://docs.rs/rstar/0.8.2/rstar/struct.RTree.html

@DocKDE If you are still working on the example code from #52 you are maybe interested in this change where I built in support for Rstart trees into the library, that should make your code a little bit more readable. If possible having a bit of feedback on the exact names and the ergonomics of using the rtree in this way would be greatly appreciated! As a sidenote I submitted a PR on rstar to include tuples as Points Stoeoef/rstar/pull/57 if that is merged that should clean up the code a little bit more.

// This code can also be found as test 'spatial_lookup' in pdb.rs
let mut model = Model::new(0);
model.add_atom(Atom::new(false, 0, "", 0.0, 0.0, 0.0, 0.0, 0.0, "", 0).unwrap(),"A",(0, None),("MET", None));
model.add_atom(Atom::new(false, 1, "", 1.0, 1.0, 1.0, 0.0, 0.0, "", 0).unwrap(),"A",(0, None),("MET", None));
model.add_atom(Atom::new(false, 2, "", 0.0, 1.0, 1.0, 0.0, 0.0, "", 0).unwrap(),"A",(0, None),("MET", None));
let mut pdb = PDB::new();
pdb.add_model(model);

let tree = pdb.create_rtree();

assert_eq!(tree.size(), 3);
assert_eq!(
    tree.nearest_neighbor(&[1.0, 1.0, 1.0]).unwrap().serial_number(),
    1
);
assert_eq!(
    tree.locate_within_distance([1.0, 1.0, 1.0], 1.0).fold(0, |acc, _| acc + 1),
    2
);
let mut neighbors = tree.nearest_neighbor_iter(&pdb.atom(0).unwrap().pos_array());
assert_eq!(neighbors.next().unwrap().serial_number(), 0);
assert_eq!(neighbors.next().unwrap().serial_number(), 2);
assert_eq!(neighbors.next().unwrap().serial_number(), 1);

I think it's a good idea to include support for this, I mean there's a reason why I implemented it in my own code in the first place :)
I had a quick look at the PR and it doesn't look too much like they will merge it. I was wondering about this myself: is there a reason why the pos() method returns a tuple and not an array? I saw that you added a pos_array() method which serves as a workaround but still.
The biggest issue I see with your rtree implementation is that there seems no straightforward way to include information about the overarching structs in a PDB. E.g. I wanted to include information about the Residue that each Atom returned by an rtree search belongs to, only that the Atom structs hold no information about this so I had to resort to including the Residue itself in the Points of the r*tree. Not sure if this is a weird edge case but I could see other users having this issue.
Apart from this I think having functions that return nearest neighbors, other atoms within a given distance or a list sorted by distance are great QoL improvements!
If you're interested in my current implementation of this, you can have a look here at ll. 295

Thanks for comments! I will think about it a little more to try to include Residue information, and make some examples on how it could be done.

I think I have a solution for the missing PDB hierarchy information. I created a new struct FullHierarchy which contains references to the chain, residue, conformer which contain a single atom. In this way this struct gives all information you need (ie if the atom is from the backbone) while still being minimal in overhead and mutability problems. I also created some other functionalities which I saw fit for the FullHierarchy, like an iterator creating these for all atoms in a chain.

I created an extra rtree function create_full_hierarchy_rtree which creates an rtree filled with these FullHierarchys but behaving in exactly the same way as the other rtree in regard to the spatial queries.

In response to your question: I tend to use tuples for positions because tuples are defined to be a specific dimensionality and because they force the writer of the software to be specific which dimension they mean. There is no real reason why I could not switch to arrays except backwards compatibility. But as I think my PR will be merged (in a long time, or in the next fork of the rtree lib) I think sticking with pos_array for now is okay.

If you have some time can you help me with the following questions: Do you think this solves your problem? Would you want to use this struct more often in your code (if yes where)? Do you think this name is reasonable?

I have taken a look and played with it a bit. I like the concept and it would definitely enable me to make my code more clear and concise. This goes for both the struct and the implementation of the create_full_hierarchy_rtree method.
If this goes on at the same pace, I can scrub all of my code completely and just do decision trees that decide when to call what function from the library. :)

As to your questions: this does not "solve" any problems for me since I worked around them and found other solutions. However, this struct and its methods eliminate the need for such workarounds in the first place which is always nice.
I have tried and implemented this in two of my methods, namely the ones dealing with finding clashes and contacts in a PDB file and finding residues in a sphere around a given atom (find_contacts and calc_residue_sphere in my nomenclature). I would probably have more use cases but these seemed the most suitable for starters. I found that especially the find_contacts method benefits a great deal from the FullHierarchy as it deals with the RTree.

All in all, I think it's a great addition!
Some comments on the implementation: It would be nice to have an atoms_full_hierarchy method for the PDB struct as well. I mostly work with this struct and such a method would remove a bit of boilerplate.
I'm not sure about the naming. As I see it, the point of the struct is to provide an alternative Atom-like struct in case you need information about the overarching hierarchy it's contained in (correct me if I'm wrong). In that case, including the word "atom" somewhere in the struct and all its methods (where it's included already it seems) seems reasonable. Maybe something like AtomWithHierarchy? And method names could be something like atoms_with_hierarchy and so on.

Thanks a lot for your reply, I really like the name you suggested and will certainly include an atoms_with_hierarchy function on the PDB.