DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers

Abstract

The DECIMER 1.0 [8] (Deep lEarning for Chemical ImagE Recognition) project [1] was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution.

The original implementation of DECIMER[1] using GPU takes a longer training time when we use a bigger dataset of more than 1 million images. To overcome these longer training times, many implement the training script to work on multiple GPUs. However, we tried to step up and implemented our code to use Google's Machine Learning hardware TPU(Tensor Processing Unit) [2]. You can learn more about the hardware here.

Method and model changes

The DECIMER now uses EfficientNet-B3 [3],[4] for Image feature extraction and a transformer model [5] for predicting the SMILES.
The SMILES [6] are encoded to SELFIES [7] during training and predictions

Changes in the training method

We converted our datasets into TFRecord Files, A binary file system the TPUs can read in a much faster way. Also, we can use these files to train on GPUs. Using the TFRecord helps us train the model fast by overcoming the bottleneck of reading multiple files from the hard disks.
We moved our data to Google Cloud Buckets. An efficient storage solution provided by google cloud environment where we can access these files from any google cloud VMs easily and in a much faster way. (To get the highest speed, the cloud storage and the VM should be in the same region)
We adopted the TensorFlow data pipeline to load all TFRecord files to the TPUs from Google Cloud Buckets.
We modified the main training code to work on TPUs using TPU strategy introduced in Tensorflow 2.0.

Documentation

Currently, we are working on improving the documentation

Datasets

The datasets are available in SMILES and SELFIES format. To generate the images, please refer to the code below. Download the datasets from Zenodo:

$ java -cp cdk-2.3.jar:. Smilesdepictor filtered_SMILES.txt

The image augmentations can be generated using the python imgaug package.

Usage:

How to re-train the models

1. Generate the image data and SMILES data using the provided Java files. Input files should be in SMILES format.

# Filter only the compounds that fit DECIMER Ruleset.
$ java -cp cdk-2.3.jar:. Pubchemfilter Input_SMILES.txt

# Generate images and save them into folders.
$ java -cp cdk-2.3.jar:. Smilesdepictor filtered_SMILES.txt

2. Generate SELFIES and split them.

$ python3 Smiles2SELFIES.py Generated_SMILES.txt

# Use sed command on linux to split the SELFIES into tokens using the square brackets.
$ sed -i 's/\]\[/\] \[/g' Generated_SELFIES.txt

3. Create TFRecords.

# Use the Create_tokenizer.py to create tokens and the file paths for image files. The input will be the Generated_SELFIES.txt file.
# This generates multiple files with tokenized SELFIES and Image paths. Also, this generates the final tokenizer.pkl and max_length.pkl, which can be used later for training.

# Use the Create_TFrecord_From_Vectors.py to generate TF records. 
$ python3 Create_TFrecord_From_Vectors.py 1

4. Move the TFRecords to Google CLOUD Storage

$ gsutil -m cp -r path/to/tfrecords/ path/to/cloud/storage

5. Train on Google Cloud TPUs.

Create a VM and a TPU node in the exact location as your google cloud storage bucket and modify the TFRecord path, tokenizer.pkl and max_length.pkl paths.

Change the TPU node name.

Once the TPU is ready on your Virtual machine console, execute: python3 TPU_Trainer_Image2Smiles_transformer.py

How to use DECIMER?

We suggest using DECIMER inside a Conda environment, which makes the dependencies to install easily.

Conda can be downloaded as part of the Anaconda or the Miniconda platforms (Python 3.7). We recommend installing miniconda3. Using Linux, you can get it with:

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Instructions

$ sudo apt update
$ sudo apt install default-jdk # In case if you do not have Java already installed

Python Package Installation

Install the latest code from GitHub with:

$ pip install git+https://github.com/Kohulan/DECIMER-Image_Transformer.git

Install in development mode with:

$ git clone https://github.com/Kohulan/DECIMER-Image_Transformer.git decimer
$ cd decimer/
$ pip install -e.

Where -e means "editable" mode.

Install from PyPi

$ pips install decimer

Install tensorflow==2.3.0 if you do not have an Nvidia GPU (On Mac OS)

CLI Usage

The Python package automatically installs the decimer command-line tool.

$ decimer --help  # Use for help

When you run the program for the first time, the models will get automatically downloaded(Note: total size is ~ 1GB). Also, you can manually download the models from here e.g.:

$ decimer --model Canonical --image Sample_Images/caffeine.png       # Predict SMILES for a single image.
$ decimer --model Isomeric --dir Sample_Images         # Predict SMILES for all the images inside a folder.

DECIMER automatically selects the Canonical model, but you can choose one of the following models

Available Models:

Canonical: Model trained on images depicted using canonical SMILES
Isomeric: Model trained on images depicted using isomeric SMILES, which includes stereochemical information + ions
Augmented: Model trained on images depicted using isomeric SMILES with augmentations

License:

This project is licensed under the MIT License - see the LICENSE file for details

Citation

Rajan, Kohulan; Zielesny, Achim; Steinbeck, Christoph (2021): DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14479287.v1

References

Rajan, K., Zielesny, A. & Steinbeck, C. DECIMER: towards deep learning for chemical image recognition. J Cheminform 12, 65 (2020). https://doi.org/10.1186/s13321-020-00469-w
Norrie T, Patil N, Yoon DH, Kurian G, Li S, Laudon J, Young C, Jouppi N, Patterson D (2021) The Design Process for Google's Training Chips: TPUv2 and TPUv3. IEEE Micro 41:56–63
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning. PMLR, pp 6105–6114
Xie Q, Luong M-T, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 10687–10698
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention Is All You Need. arXiv [cs.CL]
Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 28:31–36
Krenn M, Häse F, Nigam A, Friederich P, Aspuru-Guzik A (2020) Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach Learn: Sci Technol 1:045024
Rajan, Kohulan; Zielesny, Achim; Steinbeck, Christoph (2021): DECIMER 1.0: Deep Learning for Chemical Image Recognition using Transformers. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.14479287.v1

CanyonWind / DECIMER-Image_Transformer