uta-smile / RetroXpert

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Script to generate canonicalized_csv files

linminhtoo opened this issue · comments

Hi Chaochao and team,

Excellent work on RetroXpert! I want to train the model on my own processed version of USPTO-50k dataset (I download the 3 files from schneider50k folder and remove certain reactions for my own project, e.g. duplicates, same rxn_smi that occur in both train & test) https://www.dropbox.com/sh/6ideflxcakrak10/AADTbFBC0F8ax55-z-EDgrIza?dl=0

However, I am not sure what exactly is involved to get the same kind of 'canonicalized_csv' files. Could you please elaborate on the original source of your USPTO-50k & USPTO_FULL data, and upload the script to generate the canonicalized_csv files from that original data?

Thank you very much and happy new year.

EDIT:
I believe this is a trivial step, and use this simple function. Still, would like to confirm with authors for any other preprocessing steps involved.

def canonicalize_rxn_smi(
    rxn_smi: str,
    ):
    prod_smi = rxn_smi.split('>>')[-1]
    prod_smi_canon = Chem.MolToSmiles(Chem.MolFromSmiles(prod_smi), True)
    rcts_smi = rxn_smi.split('>>')[0]
    rcts_smi_canon = Chem.MolToSmiles(Chem.MolFromSmiles(rcts_smi), True)
    return rcts_smi_canon + '>>' + prod_smi_canon
commented

Hi,

Here is how we preprocess the product SMILES.

https://github.com/uta-smile/RetroXpert/blob/main/canonicalize_products.py

commented

We have an important update of our method. Please refer to the readme for more details.