How to use the trained model to predict the custom dataset?

Question

How to use the trained model to predict the custom dataset?

CHANG-Shaole opened this issue 2 years ago · comments

Hi, Chaochao & your research team.
Recently I trained both the EGAT and RGN models.
After that, I want to predict my own dataset to see how the model really performs.
From the issues list, I didn't find the question like this. That's why I open this issue.
For now, I have some reactions. (Just like this OB(O)C1=C(OC)C=C(C(F)(F)F)C=C1.O=C(N1)NN=C(Br)C1=O>>O=C(N1)NN=C(C2=CC=C(C(F)(F)F)C=C2OC)C1=O). In the real synthetic route, I think we only need to provide the product O=C(N1)NN=C(C2=CC=C(C(F)(F)F)C=C2OC)C1=O.
So I want to know how to configure the data stream, to use the trained models, and have a predicted result.
Anyone is free to answer the question and communicate.
Thanks in advance/

youcefBouraoui · Answer 1 · Sun May 22 2022 18:37:41 GMT+0800 (China Standard Time)

hi, CHANG-Shaole .
have you figure it out how this is done, if you not maybe we can communicate and try to see it through.

CHANG-Shaole · Answer 2 · Mon May 23 2022 10:13:24 GMT+0800 (China Standard Time)

Hi! Thank you for your reply.
I haven't completed that. But I have some ideas for that. We need to re-write some code to adapt to the issue.
We need to give up the reactants processing from the dataset generation and focus on the products processing.
I am re-writing the code, but it hasn't been done.
Do you have some ideas for that?

youcefBouraoui · Answer 3 · Mon May 23 2022 19:35:37 GMT+0800 (China Standard Time)

Same idea we need to focus on the product side, first we extract the features and pass it through EGAT so it predict the disconnection, once we have the product and the synthons we feed it to RGN( the rgn part i think there is not much to do).

UPDATE:
So for the EGAT the model input in the training phase is x_graph (which is constructed using product_adj, product_bond_features)
and x_atom(which is the product_atom_features).
i managed to create a code to extract this two inputs for just one molecule product [CH3:1][C:2]([CH3:3])([CH3:4])[O:5][C:6](=[O:7])[NH:8][NH:9][C:10](=[O:11])[c:12]1[cH:13][cH:14][c:15]([Br:16])[cH:17][c:18]1[Cl:19]
and this is the output
Graph(num_nodes=19, num_edges=57, ndata_schemes={} edata_schemes={'w': Scheme(shape=(12,), dtype=torch.bool)})
Now from here am not sure how to use the EGAT checkpoint to predict the synthons? if i find out i'll let you now also i like you to feed me back if my approach is wrong.

Allen · Answer 4 · Tue May 24 2022 08:09:17 GMT+0800 (China Standard Time)

Sorry for the late reply. To run inference, you need first prepare a molecule graph and predict bond disconnection using EGAT. Once you have the disconnection, edit the product graph to generate an intermediate molecule graph called synthon. Then convert the synthon into SMILES. The last step is to input the SMILES sequence into the RGN to obtain the reactant predictions. Hope this can help you.

youcefBouraoui · Answer 5 · Tue May 24 2022 20:20:20 GMT+0800 (China Standard Time)

i am not sure how to concatenate pattern feat with product_atom_features.any idea ?

CHANG-Shaole · Answer 6 · Wed May 25 2022 09:54:25 GMT+0800 (China Standard Time)

Hi, @youcefBouraoui
Not very clear about your problem.
But I remember after extracting the pattern feat, we need to concatenate it in the integrated data files for the EGAT train/predict. Use the extract_semi_template_pattern.py

CHANG-Shaole · Answer 7 · Wed May 25 2022 10:13:24 GMT+0800 (China Standard Time)

Sorry for the late reply. To run inference, you need first prepare a molecule graph and predict bond disconnection using EGAT. Once you have the disconnection, edit the product graph to generate an intermediate molecule graph called synthon. Then convert the synthon into SMILES. The last step is to input the SMILES sequence into the RGN to obtain the reactant predictions. Hope this can help you.

Hi, @chaoyan1037
Very thank you for the reply! I tried to use the EGAT to predict the disconnection. (Please note the input only includes the Product SMILES)
I believe this should be the disconnection probability predicted by EGAT. (the second line should be the ground truth & the third line should be the predicted result)

0 True
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 7.748306163346541e-11 7.748306163346541e-11 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Besides, if I only input the product SMILES, it should not have a ground truth (after I change some coding).
Although I have the disconnection prediction results, how can I transfer them to the synthons for the RGN predictions?
By the way, do you have the inference program for the input only including the PRODUCT SMILES?

Free for anyone's communication.

CHANG-Shaole · Answer 8 · Wed May 25 2022 10:50:37 GMT+0800 (China Standard Time)

Hi, @chaoyan1037 Sorry for bothering you again.

There is another problem with canonicalize_products.py.

I tried the rxn_smiles without mapping numbers:
OB(O)C1=C(OC)C=C(C(F)(F)F)C=C1.O=C(N1)NN=C(Br)C1=O>>O=C(N1)NN=C(C2=CC=C(C(F)(F)F)C=C2OC)C1=O
after canonicalize_products.py, I got
COc1cc(C(F)(F)F)ccc1B(O)O.O=c1[nH]nc(Br)c(=O)[nH]1>>[CH3:1][O:2][c:3]1[cH:4][c:5]([C:6]([F:7])([F:8])[F:9])[cH:10][cH:11][c:12]1-[c:13]1[n:14][nH:15][c:16](=[O:17])[nH:18][c:19]1=[O:20]

I am not sure if this format meets the requirements to generate the train/test dataset. (only product SMILES have a mapping number)

youcefBouraoui · Answer 9 · Fri Jun 10 2022 05:08:45 GMT+0800 (China Standard Time)

hi, @CHANG-Shaole any update, did you manage to create the inference script? Also did you train the model on the USPTO-full dataset, if so could you provide me with the checkpoint file. thanks