vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625

Question

vocab size of data/chembl/vocab.txt: 5623, but when get_vocab.py the new vocab.txt is 5625

bsaldivaremc2 opened this issue a year ago · comments

When following the instructions of the README.md neither of the commands shown, seem to work out of the box.
So far I added the py_modules=['hgraph'] in the setup.py and added ",clearAromaticFlags=True)" in the chemutils.py file.

Sample from checkpoint does not work:
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000

So I tried to reproduce the vocab with:
python get_vocab.py --ncpu 16 < data/chembl/all.txt > new_vocab.txt
It works. But, new_vocab.txt has 5625 lines and data/chembl/vocab.txt 5623. And there are multiple differences, not just two.

Do you have any way to sample from checkpoint without issues?
Also, why am I getting a different vocab result from the same data/chembl/all.txt file? Is there some random operation? I left all random seeds as they are in the scripts.

FlexxofIvan · Answer 1 · Thu Jul 20 2023 23:55:08 GMT+0800 (China Standard Time)

The same problem, did you solved it?

Bryan Saldivar · Answer 2 · Fri Jul 21 2023 23:36:35 GMT+0800 (China Standard Time)

I did not solve it. But. I am skipping some functionality to make it work with the provided pre-trained model and vocabulary.
I noticed that when the anchor_smiles in decoder.decode (decoder.py) is more than one, there is an error.
So I limited that the anchor_smile would be just one by adding :
if len(anchor_smiles)>1: continue
in hgraph/decoder.py
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) <-Here I added if len(inter_cands) == 0:
inter_cands, anchor_smiles, attach_points = graph_batch.get_assm_cands(fa_cluster, fa_used, ismiles) if len(anchor_smiles)>1: continue if len(inter_cands) == 0:

Bryan Saldivar · Answer 3 · Wed Jul 26 2023 15:31:56 GMT+0800 (China Standard Time)

I probably solved the problem.
It works the first 900 million times you generate.
Instead of the original vocab use this: https://github.com/bsaldivaremc2/hgraph2graph/blob/master/data/chembl/recovered_vocab_2000.txt
python generate.py --vocab data/chembl/recovered_vocab_2000.txt --model ckpt/chembl-pretrained/model.ckpt --nsample 1000.
I captured all motifs that were causing the problem and included them in the original vocab list
I replaced 27 less used motif pairs.
Details of the files here: https://github.com/bsaldivaremc2/hgraph2graph/tree/master/data/chembl