douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page:https://crates.io/crates/pdbtbx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Residue serial numbers > 9999

DocKDE opened this issue · comments

This probably never comes up when using "regular" PDB files but when solvating a protein (usually with water) it frequently happens that the number of residues exceeds 9999. At first I thought there was a bug in my code but it seems the pdbtbx open_pdb function is to blame. This is the output I get after opening such a PDB file and printing it back to stdout:

ATOM      1 N    HIM     1      20.872  37.523  43.475  0.00  0.00           N
ATOM      2 H1   HIM     1      21.537  38.222  43.773  0.00  0.00           H
ATOM      3 H2   HIM     1      21.347  36.635  43.400  0.00  0.00           H
ATOM      4 CA   HIM     1      20.442  37.855  42.095  0.00  0.00           C
ATOM      5 HA   HIM     1      20.697  37.031  41.428  0.00  0.00           H
ATOM      6 CB   HIM     1      18.932  38.092  42.024  0.00  0.00           C
ATOM      7 HB2  HIM     1      18.594  38.695  42.867  0.00  0.00           H
ATOM      8 HB3  HIM     1      18.678  38.597  41.092  0.00  0.00           H
ATOM      9 CG   HIM     1      18.135  36.826  42.056  0.00  0.00           C
ATOM     10 ND1  HIM     1      18.100  35.999  43.156  0.00  0.00           N
ATOM     11 CE1  HIM     1      17.336  34.954  42.900  0.00  0.00           C
ATOM     12 HE1  HIM     1      17.192  34.166  43.639  0.00  0.00           H
ATOM     13 NEM  HIM     1      16.857  35.084  41.678  0.00  0.00           N
ATOM     14 CD2  HIM     1      17.352  36.239  41.124  0.00  0.00           C
ATOM     15 HD2  HIM     1      17.072  36.495  40.102  0.00  0.00           H
ATOM     16 C    HIM     1      21.229  39.064  41.640  0.00  0.00           C
ATOM     17 O    HIM     1      21.045  40.168  42.148  0.00  0.00           O
ATOM     18 CME  HIM     1      15.954  34.127  41.011  0.00  0.00           C
ATOM     19 HM1  HIM     1      15.474  34.610  40.161  0.00  0.00           H
ATOM     20 HM2  HIM     1      15.192  33.791  41.715  0.00  0.00           H
ATOM     21 HM3  HIM     1      16.527  33.267  40.662  0.00  0.00           H
ATOM  32543 O    HIM     1      11.318  29.317   3.009  0.00  0.00           O
ATOM  32544 H1   HIM     1      11.670  28.720   2.350  0.00  0.00           H
ATOM  32545 H2   HIM     1      10.434  28.992   3.180  0.00  0.00           H
ATOM     22 N    THR     2      22.144  38.833  40.709  0.00  0.00           N
ATOM     23 H    THR     2      22.246  37.909  40.315  0.00  0.00           H
ATOM     24 CA   THR     2      23.072  39.863  40.292  0.00  0.00           C
ATOM     25 HA   THR     2      22.590  40.827  40.129  0.00  0.00           H
ATOM     26 CB   THR     2      24.177  40.038  41.352  0.00  0.00           C
ATOM     27 HB   THR     2      23.704  40.224  42.316  0.00  0.00           H

I think what happened is that all residues with serial number > 9999 (all waters in my case) got wrapped around to serial number 1 again and were then appended to the already existing residue with that serial number (retaining the atom serial number). Would it be possible to fix this?
I know that in cases like this there should be an insertion code or something to distinguish residues but unfortunately I have no say in the creating of these files.

How does your input file handle this? Does your input file have numbers bigger than 9999? The range of this number is defined by the PDB standard to be 4 characters (page 180 of v3.30) so it is interesting to see how other software handles this limitation. I would like to press again that the mmCIF file format does not have any such limitations (and is fully supported by this library).

In my input the residue serial numbers also wrap around 9999 and start at 1 again after that but the residues are not reordered. They retain their position in the file and the appropriate residue names and atom serial numbers. If this formatting would be retained by pdbtbx, that would already help me (and also reduce unexpected behaviour in such edge cases).
I realize that the most sensible solution here for me would be switching file formats as mmCIFs seem to be made to address something like this but the quantum chemistry code I'm using doesn't seem to support the format yet (which will hopefully change soon).

That makes it clear why PDBTBX gave this result. In reading an atom line the code internally searches for a residue matching the given serial number and insertion code and adds the atom to this residue, this is in line with the definition of the PDB format (v3.30).

Alphabet letters are commonly used for insertion code. The insertion code is used when two
residues have the same numbering. The combination of residue numbering and insertion code
defines the unique residue.

PDB File Format v. 3.3 - Page 177

So given this definition the program that created the input is not creating valid PDB files. If you have a way we could use to detect these kinds of problems (maybe with extra user input in a function somewhere) and which generates the result you want, maybe it just keeps counting upwards so go to 10001 instead of 1, I would be happy to include that in the library. But for that we need a clear description of the behavior we want and need to keep in mind what the next program using the resulting PDB files will make of it.

Possibly I could build something which detects when the serial number wraps around and adds an extra factor 10000 each time it does this. For this to be active I would want a user to explicitly ask for this (something like an extra option in opening). The result of this will not be valid to save as a PDB file unless the serial number is taken mod 10000 again. So in the best case the resulting file would be saved as a mmCIF file, to prevent problems like this in the next program opening the file.

To be very clear I do not really like solutions like this, because this goed outside the definition of PDB files, but at the same time I understand that many programs use PDB files in a way not supported by the official definition so I get that it can be a problem worth solving.

I see. Amber does indeed write invalid PDB in that respect, strange that it does this.
While it might be possible to "solve" the issue as you described, this might also break compatibility with other programs parsing the resulting file. I won't request you do that since the use seems quite limited, I'll have a look if I can find another solution with actually valid PDB files or mmCIF files first.

Okay, good luck! If it in the end is the easiest if the above solution (or something else) is implemented in this library feel free to reopen this issue (or create a new one of course).

So it took me a minute to manually add insertion codes and thus circumvent the issue entirely. I think it's fine for this issue to remain closed for the time being :)

Hey there,
I revisited this issue because I had to deal with more PDB files with more than 9999 residues lately. I thought about it some more and came to a conclusion: It's not ideal to even have such PDB files but I'm not getting around them because they are automatically generated by software I need to use. Hence it would be desirable to find a way to deal with this programmatically. Manually inserting insertion codes is, of course, possible but it becomes a bit cumbersome after doing it a couple of times and it would be nice to not need that.
What's more, I think it is not ideal for pdbtbx to deal with such files as it currently does because it will basically break the whole structure of it during the parsing process so I think some contingency is a good idea.
My suggestions would be to emit a warning when more than 9999 residues are present and not all residues have an insertion code. This is easy to do.
Furthermore, I would like the library to keep the PDB structure intact and not add different residues with the same serial number together. Thus, if one reads in a PDB file and prints it back to a file, it should remain basically unchanged (with duplicate residue numbers) and the user can deal with the situation as they see fit because I don't think there's one behaviour that covers all possible use cases.
What are your thoughts on this?

Hey there,

I do agree that handling this in code would be far superior. So if we can find a way to reliably tackle this it can certainly be included in this project.

The way pdbtbx now handles the addition of residues is by scanning the list of residues from the end until it finds its residue number or a lower number, in the later case it means that this residue is not inserted yet. Such a scanning feature is needed because all atoms from a residue need to be added to the same residue struct. It groups residues with overflowing numbers because it truncates the residue number following the pdb protocol.

A way forward I see is the addition of an extra flag to the parser telling it to use a different pdb 'spec' in which the residue number spans more columns. This does not break any normal pdb files, but would allow your 'incorrect' files to be parsed successfully. Let me know what you think of this. To implement this an example 'incorrect' pdb file is needed to see how far to extend the residue number field, especially in relation to 'HETATOM' fields.

Besides I am happy to see this project is still used, as my personal use has decreased I have not worked on it anymore for quite a while. I would be happy to help resolve issues like this. Otherwise if you want I can give you more permissions in this project (and the crate on crates.io) to keep working on it.

I have such a file here: https://owncloud.gwdg.de/index.php/s/EHObo5uTQoC0iNe
As you can see the residues don't actually exceed the predefined columns for the residue number but just overflow and wrap around at 9999. Would you then parse it into a structure that keeps counting after 9999 and extends to residue number column to the right?

Nice of you to offer but I'm not sure if my Rust skills are up to this.

The file you send me seems to be using 5 characters.

HETATM87441  O   HOH A8299      19.742  -7.202 155.194  0.00  0.00           O
HETATM87442  H1  HOH A8299      19.110  -7.853 155.499  0.00  0.00           H
HETATM87443  H2  HOH A8299      19.579  -7.133 154.253  0.00  0.00           H
HETATM87444  O   HOH A8300     -24.616 -98.167 166.753  0.00  0.00           O
HETATM87445  H1  HOH A8300     -25.525 -98.450 166.854  0.00  0.00           H
HETATM87446  H2  HOH A8300     -24.309 -98.020 167.647  0.00  0.00           H
HETATM87447  O   HOH A8301     175.737   8.792 -22.081  0.00  0.00           O
HETATM87448  H1  HOH A8301     175.631   9.698 -22.371  0.00  0.00           H
HETATM87449  H2  HOH A8301     175.114   8.297 -22.613  0.00  0.00           H
TER87449      HOH A8301
END

According to the pdb definition this is the structure:

COLUMNS       DATA  TYPE     FIELD         DEFINITION
-----------------------------------------------------------------------
 1 - 6        Record name    "HETATM"
 7 - 11       Integer        serial        Atom serial number.
13 - 16       Atom           name          Atom name.
17            Character      altLoc        Alternate location indicator.
18 - 20       Residue name   resName       Residue name.
...

Which makes the pdb you sent perfectly valid. And I thought I fixed this in issue #51. Are you sure it wraps at 9999? If so maybe you use an old version or my fix was not working.
Please let me know!
I will also do some investigations, but not right now.

I don't understand. According to the source you sent me, columns 23-26 are reserved for the residue sequence number which is 4 digits. A fifth would then not be valid. Also, the letter in column 22 is the chain ID and is the same for all atoms so it doesn't serve to differentiate between residues.
For that an insertion code in column 27 would be needed which is always empty in all of these files.

Sorry, I was looking at the wrong column. I will try to implement the rolling count you proposed.

Okay I implemented your proposal, only for that I needed to revamp the saving system for PDBs which leaves me with some errors right now. But I need some sleep so will continue tomorrow I think. Very short: I needed to build a way to force fields to be of a certain width, either by padding with spaces or cropping off the extraneous parts. I rewrote all PDB lines to use this new function, but somewhere a couple of spaces are off. ‾(o.o)/‾

Thanks for tackling it! Also sleep is more important than coding and coding while tired may not result in something very usable :)

All the previous tests are now passed plus this added test for strict equality of the pdb file you send me read from the original file and saved and read again is now passing. This solves this issue (I hope, please let me know if it does not) and any potential future issue with numbers getting too big, As you can see no additional arguments are needed, all columns/fields now implement this wrapping behavior, but only the residue serial number will actually continue counting after 9999 internally. Also you can work with these 'impossibly' high residue serial numbers in your code (see test).

/// Open a test file containing 87449 waters so with more than 29000 residues which leads to residue serial numbers that are wrapped
#[test]
fn wrapping_residue_number() {
    let (pdb, errors) = pdbtbx::open("example-pdbs/eq.pdb", StrictnessLevel::Strict).unwrap();
    let pdb_errors = save(pdb.clone(), &("dump/eq.pdb"), StrictnessLevel::Loose);
    let (pdb2, _) = pdbtbx::open("dump/eq.pdb", StrictnessLevel::Strict).unwrap();
    print!("{:?}", errors);
    print!("{:?}", pdb_errors);
    // See that the original file is the same as saved and reopened
    assert_eq!(pdb, pdb2);
    // See that it is possible to select atom with 'impossible' residue serial numbers according to the PDB definition
    // These are made by adding 10000 to the residue serial number every time a wrap is detected (9999 followed by 0)
    assert_eq!(
        pdb.residues()
            .find(|r| r.serial_number() == 10005)
            .unwrap()
            .name()
            .unwrap(),
        "HOH"
    );
    assert_eq!(
        pdb.residues()
            .find(|r| r.serial_number() == 20250)
            .unwrap()
            .name()
            .unwrap(),
        "HOH"
    );
}

Awesome! I just checked it out and the behaviour seems to be as hoped/expected.
I noticed one thing though: I manually edited the file I sent you (because previously pdbtbx couldn't deal with it properly) and added insertion codes to differentiate between the residues.
I read in this file with added insertion codes with the current master branch of pdbtbx and after saving the PDB to a file again, I noticed that the insertion codes were not written.
They are apparently correctly parsed from the input (I checked) but not written back to the output generated with save_pdb. Any idea why that is?

I will take a look.

I added a new test for insertion codes and it seems to be working just fine. The following test passed without any problem (the data is below).

#[test]
fn insertion_codes() {
    let (pdb, errors) =
        pdbtbx::open("example-pdbs/insertion_codes.pdb", StrictnessLevel::Strict).unwrap();
    let pdb_errors = save(
        pdb.clone(),
        &("dump/insertion_codes.pdb"),
        StrictnessLevel::Loose,
    );
    let (pdb2, _) = pdbtbx::open("dump/insertion_codes.pdb", StrictnessLevel::Strict).unwrap();
    print!("{:?}", errors);
    print!("{:?}", pdb_errors);
    // See that the original file is the same as saved and reopened
    assert_eq!(pdb, pdb2);
    assert_eq!(pdb.residues().count(), 2);
    assert_eq!(pdb.residue(0).unwrap().insertion_code().unwrap(), "A");
    assert_eq!(pdb.residue(1).unwrap().insertion_code().unwrap(), "B");
    assert_eq!(pdb2.residues().count(), 2);
    assert_eq!(pdb2.residue(0).unwrap().insertion_code().unwrap(), "A");
    assert_eq!(pdb2.residue(1).unwrap().insertion_code().unwrap(), "B");
}

example-pdbs/insertion_codes.pdb

ATOM      2  CA  HIS A 465A     34.226 -11.294   7.140  1.00  0.00           C  
ATOM      3  C   HIS A 465A     33.549 -10.658   8.347  1.00  0.01           C  
ATOM      3  C   HIS A 465A     33.549 -10.658   8.347  1.00999.99           C  
ATOM      8  CD2 HIS A 465B     35.297  -8.322   5.762  0.00 49.56           C  
ATOM      8  CD2 HIS A 465B     35.297  -8.322   5.762  0.01 49.56           C  
ATOM      8  CD2 HIS A 465B     35.297  -8.322   5.762999.99 49.56           C  

If your problem persists could you share the input file?

I'm so sorry! I just discovered that I pointed my Cargo.toml at my fork of pdbtbx (which is not yet up to date) and not your repo. Changed that now and it seems to work as expected. One thing though:

Don't know if it would be considered an issue because I also don't know how this would be parsed by other programs, but the residue numbers now seem to be left-justified whereas before they would be right-justified in columns 23-26. Is that intended?

Example:

2     ATOM  1     N    HIM A1         35.112  43.560  46.027  0.00  1.00          N
    1 ATOM  2     H1   HIM A1         35.066  44.568  46.010  0.00  1.00          H
    2 ATOM  3     H2   HIM A1         35.728  43.317  46.790  0.00  1.00          H
    3 ATOM  4     CA   HIM A1         35.629  43.017  44.786  0.00  1.00          C
    4 ATOM  5     HA   HIM A1         36.026  42.025  45.000  0.00  1.00          H
    5 ATOM  6     CB   HIM A1         34.514  42.858  43.765  0.00  1.00          C
    6 ATOM  7     HB2  HIM A1         34.010  43.819  43.665  0.00  1.00          H
    7 ATOM  8     HB3  HIM A1         34.908  42.740  42.755  0.00  1.00          H
    8 ATOM  9     CG   HIM A1         33.490  41.851  43.952  0.00  1.00          C
    9 ATOM  10    ND1  HIM A1         32.914  41.623  45.152  0.00  1.00          N
   10 ATOM  11    CE1  HIM A1         31.859  40.896  44.806  0.00  1.00          C

Still: Thanks for the awesome and speedy work!

Great, so the code is working and there is a new test for seems like a good outcome ;-).

The fields are indeed now all left justified where in the previous versions they were left right and center depending on the exact field. That is just because I changed the saving code, it should not impact anything but if it does it is not very hard to align every field according to standards.
Happy to have it fixed and see the project in use!