PDBeurope / ccdutils

A set of python tools to deal with PDB chemical components definitions for small molecules, taken from the wwPDB Chemical Component Dictionary, uses RDKit

Home Page:https://pdbeurope.github.io/ccdutils/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Invalid CIF for missing values in _pdbe_chem_comp_substructure.substructure_inchis

osmart opened this issue · comments

Recent changes (made after 11th March) to ccdutils have caused the PDBe chemical component definition CIF files for porphyrin-containing components to become unreadable by programs that use the PDBeCIF parser. Please see the comment below for a minimal test program that shows the problem.

Using HEM as an example ftp://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/H/HEM/HEM.cif

We have tracked down the problem to the _pdbe_chem_comp_substructure category. In HEM.cif this now reads:

loop_
_pdbe_chem_comp_substructure.comp_id
_pdbe_chem_comp_substructure.substructure_name
_pdbe_chem_comp_substructure.id
_pdbe_chem_comp_substructure.substructure_type
_pdbe_chem_comp_substructure.substructure_smiles
_pdbe_chem_comp_substructure.substructure_inchis
_pdbe_chem_comp_substructure.substructure_inchikeys
HEM MurckoScaffold S1 scaffold 'C1=CC2=[N+]3C1=Cc1ccc4n1[Fe@SP3-2]31n3c(ccc3=CC3=[N+]1C(=C4)C=C3)=C2' 'InChI=1S/C20H12N4.Fe/c1-2-14-10-16-5-6-18(23-16)12-20-8-7-19(24-20)11-17-4-3-15(22-17)9-13(1)21-14;/h1-12H;/q-2;+2' XVFTZEQSXCJEIQ-UHFFFAOYSA-N
HEM porphin-like F1 fragment C1~C~C2~C~C3~C~C~C(~C~C4~C~C~C(~C~C5~C~C~C(~C~C~1~N~2)~N~5)~N~4)~N~3 '' ''
HEM pyrrole F2 fragment 'c1cc[nH]c1' InChI=1S/C4H5N/c1-2-4-5-3-1/h1-5H KAESVJOAVNADME-UHFFFAOYSA-N

HEM.cif download 6th Feb 2023 had :

loop_
_pdbe_chem_comp_substructure.comp_id                      
_pdbe_chem_comp_substructure.substructure_name            
_pdbe_chem_comp_substructure.id                           
_pdbe_chem_comp_substructure.substructure_type            
_pdbe_chem_comp_substructure.substructure_smiles          
_pdbe_chem_comp_substructure.substructure_inchis          
_pdbe_chem_comp_substructure.substructure_inchikeys       
HEM MurckoScaffold S1 scaffold     C1=CC2=[N+]3C1=Cc1ccc4n1[Fe-2]31n3c(ccc3=CC3=[N+]1C(=C4)C=C3)=C2 InChI=1S/C20H12N4.Fe/c1-2-14-10-16-5-6-18(23-16)12-20-8-7-19(24-20)11-17-4-3-15(22-17)9-13(1)21-14;/h1-12H;/q-2;+2 XVFTZEQSXCJEIQ-UHFFFAOYSA-N
HEM   porphin-like F1 fragment C1~C~C2~C~C3~C~C~C(~C~C4~C~C~C(~C~C5~C~C~C(~C~C~1~N~2)~N~5)~N~4)~N~3                                                                                                                  .                           .
HEM        pyrrole F2 fragment                                                           c1cc[nH]c1                                                                                  InChI=1S/C4H5N/c1-2-4-5-3-1/h1-5H KAESVJOAVNADME-UHFFFAOYSA-N

The problem is in the line with substructure_name 'porphin-like' that lacks substructure_inchis. This now
uses two single quotes '' for the missing value. This is not valid CIF - missing values should be either . or ?

Demonstration of the problem

This program shows how the PDBeCIF parser now crashes on reading HEM.cif

"""
demonstrate_problem.py

demonstrate problem in _pdbe_chem_comp_substructure
records for HEM.cif.

uses cif files:
GOL.cif downloaded 6 April 2023 ftp://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/G/GOL/GOL.cif
HEM.cif downloaded 6 April 2023 ftp://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/H/HEM/HEM.cif
old_HEM.cif downloaded 12 Feb 2023
"""
from pdbecif.mmcif_io import CifFileReader

for in_file in 'GOL.cif', 'old_HEM.cif', 'HEM.cif':
    print(f'\ntest {in_file}:')
    cif_parser = CifFileReader(input='data', preserve_order=True)
    cif_contents = cif_parser.read(in_file, output='cif_dictionary')
    print(f'   success cif file has keys: {cif_contents.keys()}')

Running the program:

$ python demonstrate_problem.py 

test GOL.cif:
   success cif file has keys: odict_keys(['GOL'])

test old_HEM.cif:
   success cif file has keys: odict_keys(['HEM'])

test HEM.cif:
Traceback (most recent call last):
  File "demonstrate_problem.py", line 17, in <module>
    cif_contents = cif_parser.read(in_file, output='cif_dictionary')
  File "/Users/osmart/opt/anaconda3/envs/grade2_no_csd_api/lib/python3.7/site-packages/pdbecif/mmcif_io.py", line 258, in read
    onlyCategories=only,
  File "/Users/osmart/opt/anaconda3/envs/grade2_no_csd_api/lib/python3.7/site-packages/pdbecif/mmcif_tools.py", line 95, in parse
    file_path, ignoreCategories, preserve_token_order, onlyCategories
  File "/Users/osmart/opt/anaconda3/envs/grade2_no_csd_api/lib/python3.7/site-packages/pdbecif/mmcif_tools.py", line 187, in _parseFile
    raise MMCIFWrapperSyntaxError(category)
pdbecif.mmcif_tools.MMCIFWrapperSyntaxError: More items than values for category '_pdbe_chem_comp_substructure'!

Shows that now there is a problem in _pdbe_chem_comp_substructure.

old_HEM.cif.txt

Pull request that fixes the problem

Please see #11

The fix picks up errors in finding InChI for fragments and sets the InChI to None that is written by Gemmi as ?

Demonstrate fix works for HEM.cif

This Python jiffy:

"""
demonstrate_hem_fix.py 

read in problematic HEM.cif
find the fragments
and write out a new HEM_fix.cif to show the fix
"""
from pdbeccdutils.core import ccd_reader
from pdbeccdutils.core.fragment_library import FragmentLibrary
from pdbeccdutils.core import ccd_writer


component = ccd_reader.read_pdb_cif_file('HEM.cif').component
fragment_library = FragmentLibrary()
matches = component.library_search(fragment_library)
ccd_writer.write_molecule('HEM_fix.cif', component)

running it:

(ccdutils) osmart@macbook exchange_97 % python demonstrate_hem_fix.py                        
[18:55:24] ERROR: Unrecognized bond type: 0

WARNING:root:Computed conformer for HEM does not exist.
(ccdutils) osmart@macbook exchange_97 % grep -A 3 "pdbe_chem_comp_substructure\." HEM_fix.cif
_pdbe_chem_comp_substructure.comp_id
_pdbe_chem_comp_substructure.substructure_name
_pdbe_chem_comp_substructure.id
_pdbe_chem_comp_substructure.substructure_type
_pdbe_chem_comp_substructure.substructure_smiles
_pdbe_chem_comp_substructure.substructure_inchis
_pdbe_chem_comp_substructure.substructure_inchikeys
HEM porphin-like F1 fragment C1~C~C2~C~C3~C~C~C(~C~C4~C~C~C(~C~C5~C~C~C(~C~C~1~N~2)~N~5)~N~4)~N~3 ? ?
HEM pyrrole F2 fragment 'c1cc[nH]c1' InChI=1S/C4H5N/c1-2-4-5-3-1/h1-5H KAESVJOAVNADME-UHFFFAOYSA-N
#
(ccdutils) osmart@macbook exchange_97 % 

notice that the porphin-like line now ends in ? ? two missing CIF items 😄