Script to generate canonicalized_csv files
linminhtoo opened this issue · comments
Hi Chaochao and team,
Excellent work on RetroXpert! I want to train the model on my own processed version of USPTO-50k dataset (I download the 3 files from schneider50k folder and remove certain reactions for my own project, e.g. duplicates, same rxn_smi that occur in both train & test) https://www.dropbox.com/sh/6ideflxcakrak10/AADTbFBC0F8ax55-z-EDgrIza?dl=0
However, I am not sure what exactly is involved to get the same kind of 'canonicalized_csv' files. Could you please elaborate on the original source of your USPTO-50k & USPTO_FULL data, and upload the script to generate the canonicalized_csv files from that original data?
Thank you very much and happy new year.
EDIT:
I believe this is a trivial step, and use this simple function. Still, would like to confirm with authors for any other preprocessing steps involved.
def canonicalize_rxn_smi(
rxn_smi: str,
):
prod_smi = rxn_smi.split('>>')[-1]
prod_smi_canon = Chem.MolToSmiles(Chem.MolFromSmiles(prod_smi), True)
rcts_smi = rxn_smi.split('>>')[0]
rcts_smi_canon = Chem.MolToSmiles(Chem.MolFromSmiles(rcts_smi), True)
return rcts_smi_canon + '>>' + prod_smi_canon
Hi,
Here is how we preprocess the product SMILES.
https://github.com/uta-smile/RetroXpert/blob/main/canonicalize_products.py
We have an important update of our method. Please refer to the readme for more details.