mmCIF parser has column requirements not in line with mmCIF specification
tzok opened this issue · comments
The list of columns required by the parser is the following:
- atom_site.group_PDB
- atom_site.label_atom_id
- atom_site.id
- atom_site.type_symbol
- atom_site.label_comp_id
- atom_site.label_seq_id
- atom_site.label_asym_id
- atom_site.Cartn_x
- atom_site.Cartn_y
- atom_site.Cartn_z
- atom_site.occupancy
- atom_site.B_iso_or_equiv
- atom_site.pdbx_formal_charge
The mmCIF specification for atom_site mentions only these as required:
- atom_site.id
- atom_site.auth_asym_id
- atom_site.label_alt_id
- atom_site.label_asym_id
- atom_site.label_atom_id
- atom_site.label_comp_id
- atom_site.label_entity_id
- atom_site.label_seq_id
- atom_site.type_symbol
In particular, I have a problem with atom_site.pdbx_formal_charge. According to the docs it is used in about 7.4% entries in the PDB. Making it a strict requirement in pdbtbx is incorrect IMHO.
Fixing pdbx_formal_charge is essential to me. Making the parser more robust in general by complying with mmCIF is a good thing in the long term anyway
I do agree, I will work on removing the requirements and figuring out sensible ways of dealing with missing data.
I removed the requirements on the following columns:
- atom_site.pdbx_formal_charge
- atom_site.group_PDB
The other columns are (according to your link) always present. Although I will think about relaxing the requirements some more while refactoring the code some more.
I additionally removed the requirement for the following columns:
- atom_site.occupancy
- atom_site.B_iso_or_equiv
Leaving the Cartn
columns to be the only columns that are required on top of the mmCIF requirements. But those are (according to your link) present in 100% of the files, and very sensible to be defined if you want to use this library. In the future this requirement could be removed in favour of requiring any position, fract
or Cartn
, but that can wait.
Thanks for the issue, this gave me a push to do some nice refactorings in the parse_atoms
function and you pointed me to a nice docs site that I never found before.