Parsing PDB file without header
DocKDE opened this issue · comments
Okay, so I have what might be construed as a weird edge case. I am only working with the atoms section of PDB files, the header and all other info is of no interest to me. Also I usually run some kind of preprocessing on them to add missing hydrogens, remove alternative conformations and so on. I just tried to read such a file and save it again, however it was mangled in the process:
Input file:
ATOM 1 N HIE 1 66.397 49.061 85.017 0.00 1.00 N
ATOM 2 H1 HIE 1 66.306 48.101 84.696 0.00 1.00 H
ATOM 3 H2 HIE 1 67.181 49.491 84.536 0.00 1.00 H
ATOM 4 CA HIE 1 66.603 49.087 86.441 2.00 1.00 C
ATOM 5 HA HIE 1 67.052 50.039 86.723 0.00 1.00 H
ATOM 6 CB HIE 1 65.332 48.876 87.271 0.00 1.00 C
ATOM 7 HB2 HIE 1 64.794 47.999 86.927 0.00 1.00 H
ATOM 8 HB3 HIE 1 65.620 48.701 88.303 0.00 1.00 H
ATOM 9 CG HIE 1 64.499 50.108 87.226 0.00 1.00 C
ATOM 10 ND1 HIE 1 64.139 50.737 86.058 0.00 1.00 N
...
Output file:
ATOM 1 N HIE 1 66.397 49.061 85.017 0.00 1.00 N
TER 1 HIE 1
ATOM 2 H1 HIE 1 66.306 48.101 84.696 0.00 1.00 H
TER 2 HIE 1
ATOM 3 H2 HIE 1 67.181 49.491 84.536 0.00 1.00 H
TER 3 HIE 1
ATOM 4 CA HIE 1 66.603 49.087 86.441 2.00 1.00 C
TER 4 HIE 1
ATOM 5 HA HIE 1 67.052 50.039 86.723 0.00 1.00 H
TER 5 HIE 1
ATOM 6 CB HIE 1 65.332 48.876 87.271 0.00 1.00 C
TER 6 HIE 1
So apparently a TER statement was inserted after each atom. I assume this isn't intended behavior and it probably is a result of me using unusual PDB files. I just wanted to bring this to your attention because I'm not sure if this is something you'd want to support or not.
In principle I wouldn't mind helping out with a pull request but I'm only just learning Rust and your PDB parsing function looks a bit daunting to me :D
However, if I can help (with some directions maybe) I will.
Best!
Thanks for raising the issue, and thanks for the examples these made it quite easy for me to spot the problem. And I have to agree that the parsing function looks a bit daunting.
According to the PDB definitions there should always be TER lines after full chain definitions, so that made it clear to me that the library parsed you example as all atoms in different chains. This is because you did not provide a chain identifier, which is quite an edge case but at least the library could save them all in the same chain. So I applied a very small fix which made it so that all atoms are saved in the same chain (a3eb1bf). Also I had to edit some of the save code to make sure all columns are still aligned even if there is no chain name (bd055d2). So the saved PDB will look something like the following.
ATOM 1 N HIE 1 66.397 49.061 85.017 0.00 1.00 N
ATOM 2 H1 HIE 1 66.306 48.101 84.696 0.00 1.00 H
ATOM 3 H2 HIE 1 67.181 49.491 84.536 0.00 1.00 H
ATOM 4 CA HIE 1 66.603 49.087 86.441 2.00 1.00 C
ATOM 5 HA HIE 1 67.052 50.039 86.723 0.00 1.00 H
ATOM 6 CB HIE 1 65.332 48.876 87.271 0.00 1.00 C
ATOM 7 HB2 HIE 1 64.794 47.999 86.927 0.00 1.00 H
ATOM 8 HB3 HIE 1 65.620 48.701 88.303 0.00 1.00 H
ATOM 9 CG HIE 1 64.499 50.108 87.226 0.00 1.00 C
ATOM 10 ND1 HIE 1 64.139 50.737 86.058 0.00 1.00 N
TER 10 HIE 1
END
So the example should work as expected now, these fixes will be included in the next update (which I hope I can create in the next week, but I want to fix #47 first). I would like to press that having a chain identifier is something other programs could expect, but if you use pdb.renumber()
from this library it will be renamed to something sensible (be warned it will renumber/rename EVERYTHING).
On a side note if you have a somewhat robust way of adding missing hydrogens I would be interested to add this to the library as that is something I expect other people could benefit from as well.
Great thanks, will test it when I can.
About chain identifiers: I'm using the PDB files as input for quantum chemical calculations. As such whatever raw PDB I start with will get preprocessed and the chain identifier is usually removed anyway.
I do the addition of hydrogens (along with removal of alternative conformations and such) with Ambertools (https://ambermd.org/AmberTools.php) which is free for academics and quite useful for this task. I'm sure something of the sort could be written in Rust as well but that is probably a somewhat involved task and outside of the scope of my use case.