douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page:https://crates.io/crates/pdbtbx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bad handling of PDB files containing mixed upper and lower case chains

OWissett opened this issue · comments

If you have a PDB file with two chains with the same letter ID, but in different case, e.g., B and b, it is not possible to distinguish these chains.

Some way of distinguishing these should be implemented.

I know that strictly speaking PDB files should only have upper case chains, but this isn't always the case.

Look at 7WFF, which contains both lowercase and uppercase letters.

(I was on holiday, so sorry for the late reply)
I originally implemented it in this way because this is what I thought the specification states should be the correct behaviour. But if many other programs do not follow this, and even RCSB ignores it then we should allow this behaviour as well. And Upon a new reading of the specification I cannot find any rules stating this should be the behaviour, this is the only rule I found:

Non-blank alphanumerical character is used for chain identifier.

Here is additional comments explaining what is commonly used:
https://biology.stackexchange.com/questions/82862/why-do-chain-identifiers-in-pdb-have-no-standard-starting-chain-id-type#:~:text=Chain%20IDs%20are%20assigned%20by%20authors%20who%20submit,identifier.%20Usually%2C%20the%20chains%20are%20assigned%20uppercase%20letters.

From my experience, it is pretty common to have B and b, particularly when it might be in an asymmetric unit containing two biological assemblies.

That indeed sounds like a reasonable use for them. I will merge the PR once done.