Generation example not working

Question

Generation example not working

cristianregep opened this issue 4 years ago · comments

I downloaded the package and ran from the generation folder the suggested process :
python get_vocab.py --min_frequency 100 --ncpu 8 < ../data/polymers/all.txt > ../data/polymers/vocab.txt
python preprocess.py --train ../data/polymers/train.txt --vocab data/polymers/vocab.txt --ncpu 8

I get the following error:
"""
Traceback (most recent call last):
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "preprocess.py", line 19, in tensorize
x = MolGraph.tensorize(mol_batch, vocab, common_atom_vocab)
File "/home/cristian/Work/hgraph2graph/generation/poly_hgraph/mol_graph.py", line 168, in tensorize
tree_tensors, tree_batchG = MolGraph.tensorize_graph([x.mol_tree for x in mol_batch], vocab)
File "/home/cristian/Work/hgraph2graph/generation/poly_hgraph/mol_graph.py", line 209, in tensorize_graph
fnode[v] = vocab[attr]
File "/home/cristian/Work/hgraph2graph/generation/poly_hgraph/vocab.py", line 43, in getitem
return self.hmap[x[0]], self.vmap[x]
KeyError: ('C1=CSC=N1', 'N1=[CH:2]S[CH:2]=[CH:1]1')
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "preprocess.py", line 49, in
all_data = pool.map(func, batches)
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/cristian/anaconda3/envs/hgraph/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
KeyError: ('C1=CSC=N1', 'N1=[CH:2]S[CH:2]=[CH:1]1')

Cristian Regep · Answer 1 · Tue Apr 21 2020 18:29:18 GMT+0800 (China Standard Time)

I traced the issue to be the fact that you load the motifs from the vocab in preprocess.py, instead of loading the original motifs that pass the min_frequency mark in get_vocab.py
MolGraph.load_fragments([x[0] for x in vocab])

I got rid of the behaviour by saving the original fragments in a separate file after get_vocab.py and then loading them in preprocess.py. What I think is happening is that molecules are not split in the same way because of the difference of starting fragments.

Wengong Jin · Answer 2 · Fri Apr 24 2020 02:57:21 GMT+0800 (China Standard Time)

Hi,

I fixed this issue and now it should be able to run. Thank you!

HayeonLee · Answer 3 · Thu Jun 25 2020 15:23:00 GMT+0800 (China Standard Time)

Hi, when I tried to run the generation example, a similar error occurs as below.
Could you check this error? @wengong-jin

code:
python preprocess.py --train ../data/polymers/train.txt --vocab ../data/polymers/inter_vocab.txt --ncpu 8

error:
Traceback (most recent call last): File "preprocess.py", line 48, in <module> all_data = pool.map(func, batches) File "/st2/hayeon/anaconda3/envs/metasamp/lib/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/st2/hayeon/anaconda3/envs/metasamp/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value KeyError: ('CN1C(=O)C2=C3C(=C(F)C=C4C(=O)N(C)C(=O)C(=C43)C(F)=C2)C1=O', 'CN1C(=O)C2=CC(F)=C3C(=O)N(C)C(=O)C4=C3C2=C(C1=O)C(F)=[CH:1]4')

Wengong Jin · Answer 4 · Thu Jun 25 2020 23:15:22 GMT+0800 (China Standard Time)

Hi,

I tried running the same command and there was no error. I think what you can do is to run get_vocab.py and see if the output is different from data/polymers/inter_vocab.txt. If they are different (I would be surprised), please try rerun preprocess.py and see if it succeeds.

Nikhil Mittal · Answer 5 · Thu Aug 20 2020 17:18:33 GMT+0800 (China Standard Time)

Hi, I had the same trouble. I found the problem to be that the string being called to map from vocab is different from the ones available. In my case there was difference in the SMILES represntation of the double bond C(O) and C(=O)

Could you tell how the problem can be resolved?