MolScribe

This is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical structure. Try our demo on HuggingFace!

If you use MolScribe in your research, please cite our paper.

@article{
    MolScribe,
    title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},
    author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},
    journal = {Journal of Chemical Information and Modeling},
    publisher = {American Chemical Society ({ACS})},
    doi = {10.1021/acs.jcim.2c01480},
    year = 2023,
}

Please check out our work on reaction diagram parsing: RxnScribe (paper, code, demo)!

Quick Start

Installation

Option 1: Install MolScribe with pip

pip install MolScribe

Option 2: Run the following command to install the package and its dependencies

git clone git@github.com:thomas0809/MolScribe.git
cd MolScribe
python setup.py install

Example

Download the MolScribe checkpoint from HuggingFace Hub and predict molecular structures:

import torch
from molscribe import MolScribe
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')

model = MolScribe(ckpt_path, device=torch.device('cpu'))
output = model.predict_image_file('assets/example.png', return_atoms_bonds=True, return_confidence=True)

The output is a dictionary, with the following format

{
    'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',
    'molfile': '***', 
    'confidence': 0.9175,
    'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],
    'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]
}

Please refer to molscribe/interface.py and notebook/predict.ipynb for details and other available APIs.

For development or reproducing the experiments, please follow the instructions below.

Experiments

Requirements

Install the required packages

pip install -r requirements.txt

Data

For training or evaluation, please download the corresponding datasets to data/.

Training data:

Datasets	Description
USPTO Download	Downloaded from USPTO, Grant Red Book.
PubChem Download	Molecules are downloaded from PubChem, and images are dynamically rendered during training.

Benchmarks:

Category	Datasets	Description
Synthetic Download	Indigo ChemDraw	Images are rendered by Indigo and ChemDraw.
Realistic Download	CLEF UOB USPTO Staker ACS	CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. ACS is a new dataset collected by ourself.
Perturbed Download	CLEF UOB USPTO Staker	Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/

Model

Our model checkpoints can be downloaded from Dropbox or HuggingFace Hub.

Model architecture:

Encoder: Swin Transformer, Swin-B.
Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.
Input size: 384x384

Download the model checkpoint to reproduce our experiments:

mkdir -p ckpts
wget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth

Prediction

python predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png

MolScribe prediction interface is in molscribe/interface.py. See python script predict.py or jupyter notebook notebook/predict.ipynb for example usage.

Evaluate MolScribe

bash scripts/eval_uspto_joint_chartok_1m680k.sh

The script uses one GPU and batch size of 64 by default. If more GPUs are available, update NUM_GPUS_PER_NODE and BATCH_SIZE for faster evaluation.

Train MolScribe

bash scripts/train_uspto_joint_chartok_1m680k.sh

The script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs. During training, we use a modified code of Indigo (included in molscribe/indigo/).

Evaluation Script

We implement a standalone evaluation script evaluate.py. Example usage:

python evaluate.py \
    --gold_file data/real/acs.csv \
    --pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \
    --pred_field post_SMILES

The prediction should be saved in a csv file, with columns image_id for the index (must match the gold file), and SMILES for predicted SMILES. If prediction has a different column name, specify it with --pred_field.

The result contains three scores:

canon_smiles: our main metric, exact matching accuracy.
graph: graph exact matching accuracy, ignoring tetrahedral chirality.
chiral: exact matching accuracy on chiral molecules.

thomas0809 / MolScribe