douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page:https://crates.io/crates/pdbtbx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parsing PDB file without header

DocKDE opened this issue · comments

Okay, so I have what might be construed as a weird edge case. I am only working with the atoms section of PDB files, the header and all other info is of no interest to me. Also I usually run some kind of preprocessing on them to add missing hydrogens, remove alternative conformations and so on. I just tried to read such a file and save it again, however it was mangled in the process:

Input file:

ATOM      1 N    HIE     1      66.397  49.061  85.017  0.00  1.00           N          
ATOM      2 H1   HIE     1      66.306  48.101  84.696  0.00  1.00           H
ATOM      3 H2   HIE     1      67.181  49.491  84.536  0.00  1.00           H
ATOM      4 CA   HIE     1      66.603  49.087  86.441  2.00  1.00           C
ATOM      5 HA   HIE     1      67.052  50.039  86.723  0.00  1.00           H
ATOM      6 CB   HIE     1      65.332  48.876  87.271  0.00  1.00           C
ATOM      7 HB2  HIE     1      64.794  47.999  86.927  0.00  1.00           H
ATOM      8 HB3  HIE     1      65.620  48.701  88.303  0.00  1.00           H
ATOM      9 CG   HIE     1      64.499  50.108  87.226  0.00  1.00           C
ATOM     10 ND1  HIE     1      64.139  50.737  86.058  0.00  1.00           N
...

Output file:

ATOM      1  N   HIE    1      66.397  49.061  85.017  0.00  1.00           N           
TER    1      HIE    1
ATOM      2  H1  HIE    1      66.306  48.101  84.696  0.00  1.00           H
TER    2      HIE    1
ATOM      3  H2  HIE    1      67.181  49.491  84.536  0.00  1.00           H
TER    3      HIE    1
ATOM      4  CA  HIE    1      66.603  49.087  86.441  2.00  1.00           C
TER    4      HIE    1
ATOM      5  HA  HIE    1      67.052  50.039  86.723  0.00  1.00           H
TER    5      HIE    1
ATOM      6  CB  HIE    1      65.332  48.876  87.271  0.00  1.00           C
TER    6      HIE    1

So apparently a TER statement was inserted after each atom. I assume this isn't intended behavior and it probably is a result of me using unusual PDB files. I just wanted to bring this to your attention because I'm not sure if this is something you'd want to support or not.
In principle I wouldn't mind helping out with a pull request but I'm only just learning Rust and your PDB parsing function looks a bit daunting to me :D
However, if I can help (with some directions maybe) I will.
Best!

Thanks for raising the issue, and thanks for the examples these made it quite easy for me to spot the problem. And I have to agree that the parsing function looks a bit daunting.

According to the PDB definitions there should always be TER lines after full chain definitions, so that made it clear to me that the library parsed you example as all atoms in different chains. This is because you did not provide a chain identifier, which is quite an edge case but at least the library could save them all in the same chain. So I applied a very small fix which made it so that all atoms are saved in the same chain (a3eb1bf). Also I had to edit some of the save code to make sure all columns are still aligned even if there is no chain name (bd055d2). So the saved PDB will look something like the following.

ATOM      1  N   HIE     1      66.397  49.061  85.017  0.00  1.00           N
ATOM      2  H1  HIE     1      66.306  48.101  84.696  0.00  1.00           H
ATOM      3  H2  HIE     1      67.181  49.491  84.536  0.00  1.00           H
ATOM      4  CA  HIE     1      66.603  49.087  86.441  2.00  1.00           C
ATOM      5  HA  HIE     1      67.052  50.039  86.723  0.00  1.00           H
ATOM      6  CB  HIE     1      65.332  48.876  87.271  0.00  1.00           C
ATOM      7 HB2  HIE     1      64.794  47.999  86.927  0.00  1.00           H
ATOM      8 HB3  HIE     1      65.620  48.701  88.303  0.00  1.00           H
ATOM      9  CG  HIE     1      64.499  50.108  87.226  0.00  1.00           C
ATOM     10 ND1  HIE     1      64.139  50.737  86.058  0.00  1.00           N
TER   10      HIE    1
END

So the example should work as expected now, these fixes will be included in the next update (which I hope I can create in the next week, but I want to fix #47 first). I would like to press that having a chain identifier is something other programs could expect, but if you use pdb.renumber() from this library it will be renamed to something sensible (be warned it will renumber/rename EVERYTHING).

On a side note if you have a somewhat robust way of adding missing hydrogens I would be interested to add this to the library as that is something I expect other people could benefit from as well.

Great thanks, will test it when I can.
About chain identifiers: I'm using the PDB files as input for quantum chemical calculations. As such whatever raw PDB I start with will get preprocessed and the chain identifier is usually removed anyway.
I do the addition of hydrogens (along with removal of alternative conformations and such) with Ambertools (https://ambermd.org/AmberTools.php) which is free for academics and quite useful for this task. I'm sure something of the sort could be written in Rust as well but that is probably a somewhat involved task and outside of the scope of my use case.