Script to generate canonicalized_csv files

Question

Script to generate canonicalized_csv files

linminhtoo opened this issue 4 years ago · comments

Hi Chaochao and team,

Excellent work on RetroXpert! I want to train the model on my own processed version of USPTO-50k dataset (I download the 3 files from schneider50k folder and remove certain reactions for my own project, e.g. duplicates, same rxn_smi that occur in both train & test) https://www.dropbox.com/sh/6ideflxcakrak10/AADTbFBC0F8ax55-z-EDgrIza?dl=0

However, I am not sure what exactly is involved to get the same kind of 'canonicalized_csv' files. Could you please elaborate on the original source of your USPTO-50k & USPTO_FULL data, and upload the script to generate the canonicalized_csv files from that original data?

Thank you very much and happy new year.

EDIT:
I believe this is a trivial step, and use this simple function. Still, would like to confirm with authors for any other preprocessing steps involved.

def canonicalize_rxn_smi(
    rxn_smi: str,
    ):
    prod_smi = rxn_smi.split('>>')[-1]
    prod_smi_canon = Chem.MolToSmiles(Chem.MolFromSmiles(prod_smi), True)
    rcts_smi = rxn_smi.split('>>')[0]
    rcts_smi_canon = Chem.MolToSmiles(Chem.MolFromSmiles(rcts_smi), True)
    return rcts_smi_canon + '>>' + prod_smi_canon

Allen · Answer 1 · Wed Jan 06 2021 00:17:42 GMT+0800 (China Standard Time)

Hi,

Here is how we preprocess the product SMILES.

https://github.com/uta-smile/RetroXpert/blob/main/canonicalize_products.py

Allen · Answer 2 · Thu Apr 15 2021 10:15:41 GMT+0800 (China Standard Time)

We have an important update of our method. Please refer to the readme for more details.