Residue serial numbers > 9999
DocKDE opened this issue · comments
This probably never comes up when using "regular" PDB files but when solvating a protein (usually with water) it frequently happens that the number of residues exceeds 9999. At first I thought there was a bug in my code but it seems the pdbtbx open_pdb function is to blame. This is the output I get after opening such a PDB file and printing it back to stdout:
ATOM 1 N HIM 1 20.872 37.523 43.475 0.00 0.00 N
ATOM 2 H1 HIM 1 21.537 38.222 43.773 0.00 0.00 H
ATOM 3 H2 HIM 1 21.347 36.635 43.400 0.00 0.00 H
ATOM 4 CA HIM 1 20.442 37.855 42.095 0.00 0.00 C
ATOM 5 HA HIM 1 20.697 37.031 41.428 0.00 0.00 H
ATOM 6 CB HIM 1 18.932 38.092 42.024 0.00 0.00 C
ATOM 7 HB2 HIM 1 18.594 38.695 42.867 0.00 0.00 H
ATOM 8 HB3 HIM 1 18.678 38.597 41.092 0.00 0.00 H
ATOM 9 CG HIM 1 18.135 36.826 42.056 0.00 0.00 C
ATOM 10 ND1 HIM 1 18.100 35.999 43.156 0.00 0.00 N
ATOM 11 CE1 HIM 1 17.336 34.954 42.900 0.00 0.00 C
ATOM 12 HE1 HIM 1 17.192 34.166 43.639 0.00 0.00 H
ATOM 13 NEM HIM 1 16.857 35.084 41.678 0.00 0.00 N
ATOM 14 CD2 HIM 1 17.352 36.239 41.124 0.00 0.00 C
ATOM 15 HD2 HIM 1 17.072 36.495 40.102 0.00 0.00 H
ATOM 16 C HIM 1 21.229 39.064 41.640 0.00 0.00 C
ATOM 17 O HIM 1 21.045 40.168 42.148 0.00 0.00 O
ATOM 18 CME HIM 1 15.954 34.127 41.011 0.00 0.00 C
ATOM 19 HM1 HIM 1 15.474 34.610 40.161 0.00 0.00 H
ATOM 20 HM2 HIM 1 15.192 33.791 41.715 0.00 0.00 H
ATOM 21 HM3 HIM 1 16.527 33.267 40.662 0.00 0.00 H
ATOM 32543 O HIM 1 11.318 29.317 3.009 0.00 0.00 O
ATOM 32544 H1 HIM 1 11.670 28.720 2.350 0.00 0.00 H
ATOM 32545 H2 HIM 1 10.434 28.992 3.180 0.00 0.00 H
ATOM 22 N THR 2 22.144 38.833 40.709 0.00 0.00 N
ATOM 23 H THR 2 22.246 37.909 40.315 0.00 0.00 H
ATOM 24 CA THR 2 23.072 39.863 40.292 0.00 0.00 C
ATOM 25 HA THR 2 22.590 40.827 40.129 0.00 0.00 H
ATOM 26 CB THR 2 24.177 40.038 41.352 0.00 0.00 C
ATOM 27 HB THR 2 23.704 40.224 42.316 0.00 0.00 H
I think what happened is that all residues with serial number > 9999 (all waters in my case) got wrapped around to serial number 1 again and were then appended to the already existing residue with that serial number (retaining the atom serial number). Would it be possible to fix this?
I know that in cases like this there should be an insertion code or something to distinguish residues but unfortunately I have no say in the creating of these files.
How does your input file handle this? Does your input file have numbers bigger than 9999? The range of this number is defined by the PDB standard to be 4 characters (page 180 of v3.30) so it is interesting to see how other software handles this limitation. I would like to press again that the mmCIF file format does not have any such limitations (and is fully supported by this library).
In my input the residue serial numbers also wrap around 9999 and start at 1 again after that but the residues are not reordered. They retain their position in the file and the appropriate residue names and atom serial numbers. If this formatting would be retained by pdbtbx, that would already help me (and also reduce unexpected behaviour in such edge cases).
I realize that the most sensible solution here for me would be switching file formats as mmCIFs seem to be made to address something like this but the quantum chemistry code I'm using doesn't seem to support the format yet (which will hopefully change soon).
That makes it clear why PDBTBX gave this result. In reading an atom line the code internally searches for a residue matching the given serial number and insertion code and adds the atom to this residue, this is in line with the definition of the PDB format (v3.30).
Alphabet letters are commonly used for insertion code. The insertion code is used when two
residues have the same numbering. The combination of residue numbering and insertion code
defines the unique residue.
PDB File Format v. 3.3 - Page 177
So given this definition the program that created the input is not creating valid PDB files. If you have a way we could use to detect these kinds of problems (maybe with extra user input in a function somewhere) and which generates the result you want, maybe it just keeps counting upwards so go to 10001 instead of 1, I would be happy to include that in the library. But for that we need a clear description of the behavior we want and need to keep in mind what the next program using the resulting PDB files will make of it.
Possibly I could build something which detects when the serial number wraps around and adds an extra factor 10000 each time it does this. For this to be active I would want a user to explicitly ask for this (something like an extra option in opening). The result of this will not be valid to save as a PDB file unless the serial number is taken mod 10000 again. So in the best case the resulting file would be saved as a mmCIF file, to prevent problems like this in the next program opening the file.
To be very clear I do not really like solutions like this, because this goed outside the definition of PDB files, but at the same time I understand that many programs use PDB files in a way not supported by the official definition so I get that it can be a problem worth solving.
I see. Amber does indeed write invalid PDB in that respect, strange that it does this.
While it might be possible to "solve" the issue as you described, this might also break compatibility with other programs parsing the resulting file. I won't request you do that since the use seems quite limited, I'll have a look if I can find another solution with actually valid PDB files or mmCIF files first.
Okay, good luck! If it in the end is the easiest if the above solution (or something else) is implemented in this library feel free to reopen this issue (or create a new one of course).
So it took me a minute to manually add insertion codes and thus circumvent the issue entirely. I think it's fine for this issue to remain closed for the time being :)
Hey there,
I revisited this issue because I had to deal with more PDB files with more than 9999 residues lately. I thought about it some more and came to a conclusion: It's not ideal to even have such PDB files but I'm not getting around them because they are automatically generated by software I need to use. Hence it would be desirable to find a way to deal with this programmatically. Manually inserting insertion codes is, of course, possible but it becomes a bit cumbersome after doing it a couple of times and it would be nice to not need that.
What's more, I think it is not ideal for pdbtbx to deal with such files as it currently does because it will basically break the whole structure of it during the parsing process so I think some contingency is a good idea.
My suggestions would be to emit a warning when more than 9999 residues are present and not all residues have an insertion code. This is easy to do.
Furthermore, I would like the library to keep the PDB structure intact and not add different residues with the same serial number together. Thus, if one reads in a PDB file and prints it back to a file, it should remain basically unchanged (with duplicate residue numbers) and the user can deal with the situation as they see fit because I don't think there's one behaviour that covers all possible use cases.
What are your thoughts on this?
Hey there,
I do agree that handling this in code would be far superior. So if we can find a way to reliably tackle this it can certainly be included in this project.
The way pdbtbx now handles the addition of residues is by scanning the list of residues from the end until it finds its residue number or a lower number, in the later case it means that this residue is not inserted yet. Such a scanning feature is needed because all atoms from a residue need to be added to the same residue struct. It groups residues with overflowing numbers because it truncates the residue number following the pdb protocol.
A way forward I see is the addition of an extra flag to the parser telling it to use a different pdb 'spec' in which the residue number spans more columns. This does not break any normal pdb files, but would allow your 'incorrect' files to be parsed successfully. Let me know what you think of this. To implement this an example 'incorrect' pdb file is needed to see how far to extend the residue number field, especially in relation to 'HETATOM' fields.
Besides I am happy to see this project is still used, as my personal use has decreased I have not worked on it anymore for quite a while. I would be happy to help resolve issues like this. Otherwise if you want I can give you more permissions in this project (and the crate on crates.io) to keep working on it.
I have such a file here: https://owncloud.gwdg.de/index.php/s/EHObo5uTQoC0iNe
As you can see the residues don't actually exceed the predefined columns for the residue number but just overflow and wrap around at 9999. Would you then parse it into a structure that keeps counting after 9999 and extends to residue number column to the right?
Nice of you to offer but I'm not sure if my Rust skills are up to this.
The file you send me seems to be using 5 characters.
HETATM87441 O HOH A8299 19.742 -7.202 155.194 0.00 0.00 O
HETATM87442 H1 HOH A8299 19.110 -7.853 155.499 0.00 0.00 H
HETATM87443 H2 HOH A8299 19.579 -7.133 154.253 0.00 0.00 H
HETATM87444 O HOH A8300 -24.616 -98.167 166.753 0.00 0.00 O
HETATM87445 H1 HOH A8300 -25.525 -98.450 166.854 0.00 0.00 H
HETATM87446 H2 HOH A8300 -24.309 -98.020 167.647 0.00 0.00 H
HETATM87447 O HOH A8301 175.737 8.792 -22.081 0.00 0.00 O
HETATM87448 H1 HOH A8301 175.631 9.698 -22.371 0.00 0.00 H
HETATM87449 H2 HOH A8301 175.114 8.297 -22.613 0.00 0.00 H
TER87449 HOH A8301
END
According to the pdb definition this is the structure:
COLUMNS DATA TYPE FIELD DEFINITION
-----------------------------------------------------------------------
1 - 6 Record name "HETATM"
7 - 11 Integer serial Atom serial number.
13 - 16 Atom name Atom name.
17 Character altLoc Alternate location indicator.
18 - 20 Residue name resName Residue name.
...
Which makes the pdb you sent perfectly valid. And I thought I fixed this in issue #51. Are you sure it wraps at 9999? If so maybe you use an old version or my fix was not working.
Please let me know!
I will also do some investigations, but not right now.
I don't understand. According to the source you sent me, columns 23-26 are reserved for the residue sequence number which is 4 digits. A fifth would then not be valid. Also, the letter in column 22 is the chain ID and is the same for all atoms so it doesn't serve to differentiate between residues.
For that an insertion code in column 27 would be needed which is always empty in all of these files.
Sorry, I was looking at the wrong column. I will try to implement the rolling count you proposed.
Okay I implemented your proposal, only for that I needed to revamp the saving system for PDBs which leaves me with some errors right now. But I need some sleep so will continue tomorrow I think. Very short: I needed to build a way to force fields to be of a certain width, either by padding with spaces or cropping off the extraneous parts. I rewrote all PDB lines to use this new function, but somewhere a couple of spaces are off. ‾(o.o)/‾
Thanks for tackling it! Also sleep is more important than coding and coding while tired may not result in something very usable :)
All the previous tests are now passed plus this added test for strict equality of the pdb file you send me read from the original file and saved and read again is now passing. This solves this issue (I hope, please let me know if it does not) and any potential future issue with numbers getting too big, As you can see no additional arguments are needed, all columns/fields now implement this wrapping behavior, but only the residue serial number will actually continue counting after 9999 internally. Also you can work with these 'impossibly' high residue serial numbers in your code (see test).
/// Open a test file containing 87449 waters so with more than 29000 residues which leads to residue serial numbers that are wrapped
#[test]
fn wrapping_residue_number() {
let (pdb, errors) = pdbtbx::open("example-pdbs/eq.pdb", StrictnessLevel::Strict).unwrap();
let pdb_errors = save(pdb.clone(), &("dump/eq.pdb"), StrictnessLevel::Loose);
let (pdb2, _) = pdbtbx::open("dump/eq.pdb", StrictnessLevel::Strict).unwrap();
print!("{:?}", errors);
print!("{:?}", pdb_errors);
// See that the original file is the same as saved and reopened
assert_eq!(pdb, pdb2);
// See that it is possible to select atom with 'impossible' residue serial numbers according to the PDB definition
// These are made by adding 10000 to the residue serial number every time a wrap is detected (9999 followed by 0)
assert_eq!(
pdb.residues()
.find(|r| r.serial_number() == 10005)
.unwrap()
.name()
.unwrap(),
"HOH"
);
assert_eq!(
pdb.residues()
.find(|r| r.serial_number() == 20250)
.unwrap()
.name()
.unwrap(),
"HOH"
);
}
Awesome! I just checked it out and the behaviour seems to be as hoped/expected.
I noticed one thing though: I manually edited the file I sent you (because previously pdbtbx couldn't deal with it properly) and added insertion codes to differentiate between the residues.
I read in this file with added insertion codes with the current master branch of pdbtbx and after saving the PDB to a file again, I noticed that the insertion codes were not written.
They are apparently correctly parsed from the input (I checked) but not written back to the output generated with save_pdb
. Any idea why that is?
I will take a look.
I added a new test for insertion codes and it seems to be working just fine. The following test passed without any problem (the data is below).
#[test]
fn insertion_codes() {
let (pdb, errors) =
pdbtbx::open("example-pdbs/insertion_codes.pdb", StrictnessLevel::Strict).unwrap();
let pdb_errors = save(
pdb.clone(),
&("dump/insertion_codes.pdb"),
StrictnessLevel::Loose,
);
let (pdb2, _) = pdbtbx::open("dump/insertion_codes.pdb", StrictnessLevel::Strict).unwrap();
print!("{:?}", errors);
print!("{:?}", pdb_errors);
// See that the original file is the same as saved and reopened
assert_eq!(pdb, pdb2);
assert_eq!(pdb.residues().count(), 2);
assert_eq!(pdb.residue(0).unwrap().insertion_code().unwrap(), "A");
assert_eq!(pdb.residue(1).unwrap().insertion_code().unwrap(), "B");
assert_eq!(pdb2.residues().count(), 2);
assert_eq!(pdb2.residue(0).unwrap().insertion_code().unwrap(), "A");
assert_eq!(pdb2.residue(1).unwrap().insertion_code().unwrap(), "B");
}
example-pdbs/insertion_codes.pdb
ATOM 2 CA HIS A 465A 34.226 -11.294 7.140 1.00 0.00 C
ATOM 3 C HIS A 465A 33.549 -10.658 8.347 1.00 0.01 C
ATOM 3 C HIS A 465A 33.549 -10.658 8.347 1.00999.99 C
ATOM 8 CD2 HIS A 465B 35.297 -8.322 5.762 0.00 49.56 C
ATOM 8 CD2 HIS A 465B 35.297 -8.322 5.762 0.01 49.56 C
ATOM 8 CD2 HIS A 465B 35.297 -8.322 5.762999.99 49.56 C
If your problem persists could you share the input file?
I'm so sorry! I just discovered that I pointed my Cargo.toml
at my fork of pdbtbx
(which is not yet up to date) and not your repo. Changed that now and it seems to work as expected. One thing though:
Don't know if it would be considered an issue because I also don't know how this would be parsed by other programs, but the residue numbers now seem to be left-justified whereas before they would be right-justified in columns 23-26. Is that intended?
Example:
2 ATOM 1 N HIM A1 35.112 43.560 46.027 0.00 1.00 N
1 ATOM 2 H1 HIM A1 35.066 44.568 46.010 0.00 1.00 H
2 ATOM 3 H2 HIM A1 35.728 43.317 46.790 0.00 1.00 H
3 ATOM 4 CA HIM A1 35.629 43.017 44.786 0.00 1.00 C
4 ATOM 5 HA HIM A1 36.026 42.025 45.000 0.00 1.00 H
5 ATOM 6 CB HIM A1 34.514 42.858 43.765 0.00 1.00 C
6 ATOM 7 HB2 HIM A1 34.010 43.819 43.665 0.00 1.00 H
7 ATOM 8 HB3 HIM A1 34.908 42.740 42.755 0.00 1.00 H
8 ATOM 9 CG HIM A1 33.490 41.851 43.952 0.00 1.00 C
9 ATOM 10 ND1 HIM A1 32.914 41.623 45.152 0.00 1.00 N
10 ATOM 11 CE1 HIM A1 31.859 40.896 44.806 0.00 1.00 C
Still: Thanks for the awesome and speedy work!
Great, so the code is working and there is a new test for seems like a good outcome ;-).
The fields are indeed now all left justified where in the previous versions they were left right and center depending on the exact field. That is just because I changed the saving code, it should not impact anything but if it does it is not very hard to align every field according to standards.
Happy to have it fixed and see the project in use!