douweschulte / pdbtbx

A library to open/edit/save (crystallographic) Protein Data Bank (PDB) and mmCIF files in Rust.

Home Page:https://crates.io/crates/pdbtbx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mmCIF parser has column requirements not in line with mmCIF specification

tzok opened this issue · comments

The list of columns required by the parser is the following:

  • atom_site.group_PDB
  • atom_site.label_atom_id
  • atom_site.id
  • atom_site.type_symbol
  • atom_site.label_comp_id
  • atom_site.label_seq_id
  • atom_site.label_asym_id
  • atom_site.Cartn_x
  • atom_site.Cartn_y
  • atom_site.Cartn_z
  • atom_site.occupancy
  • atom_site.B_iso_or_equiv
  • atom_site.pdbx_formal_charge

The mmCIF specification for atom_site mentions only these as required:

  • atom_site.id
  • atom_site.auth_asym_id
  • atom_site.label_alt_id
  • atom_site.label_asym_id
  • atom_site.label_atom_id
  • atom_site.label_comp_id
  • atom_site.label_entity_id
  • atom_site.label_seq_id
  • atom_site.type_symbol

In particular, I have a problem with atom_site.pdbx_formal_charge. According to the docs it is used in about 7.4% entries in the PDB. Making it a strict requirement in pdbtbx is incorrect IMHO.

Fixing pdbx_formal_charge is essential to me. Making the parser more robust in general by complying with mmCIF is a good thing in the long term anyway

I do agree, I will work on removing the requirements and figuring out sensible ways of dealing with missing data.

I removed the requirements on the following columns:

  • atom_site.pdbx_formal_charge
  • atom_site.group_PDB

The other columns are (according to your link) always present. Although I will think about relaxing the requirements some more while refactoring the code some more.

I additionally removed the requirement for the following columns:

  • atom_site.occupancy
  • atom_site.B_iso_or_equiv

Leaving the Cartn columns to be the only columns that are required on top of the mmCIF requirements. But those are (according to your link) present in 100% of the files, and very sensible to be defined if you want to use this library. In the future this requirement could be removed in favour of requiring any position, fract or Cartn, but that can wait.

Thanks for the issue, this gave me a push to do some nice refactorings in the parse_atoms function and you pointed me to a nice docs site that I never found before.