Pretrained chembl model requires old rdkit
roselightheart opened this issue · comments
In order to use the model checkpoint trained on chembl, you need to be on rdkit=2019.03.4
, which isn't mentioned in the readme. If you're on a newer version, you'll get a KeyError
when the model tries to look up SMILES in its vocabulary. I know this repo is sparsely maintained, so I'm mostly leaving this as a search term for anyone else who wants to use that checkpoint in the future.
Getting the same issue - here's the exact error message for others' reference:
python preprocess.py --train data/chembl/all.txt --vocab data/chembl/vocab.txt --ncpu 16 --mode single
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/marcase/hgraph2graph/preprocess.py", line 19, in tensorize
x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 153, in tensorize
tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
File "/home/marcase/hgraph2graph/hgraph/mol_graph.py", line 194, in tensorize_graph
fnode[v] = vocab[attr]
File "/home/marcase/hgraph2graph/hgraph/vocab.py", line 43, in __getitem__
return self.hmap[x[0]], self.vmap[x]
KeyError: 'C1=NN=CN1'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/marcase/hgraph2graph/preprocess.py", line 106, in <module>
all_data = pool.map(func, batches)
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/marcase/.conda/envs/conda-test/lib/python3.10/multiprocessing/pool.py", line 771, in get
raise self._value
KeyError: 'C1=NN=CN1'
Found a super easy solution to this problem - just generate a fresh vocab from the dataset rather than using the one provided. I think an rdkit update changed a couple of the ways the smiles strings are generated, particularly from the aromatic groups (this was mentioned in another issue thread).