robmacc / vaemols

Variational Autoencoder for Molecules

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Variational Autoencoder for Molecules

Variational autoencoder for molecules in tensorflow.

Dependencies

  1. Rdkit
conda install -c rdkit rdkit
  1. Tensorflow

cpu-version

pip install tensorflow

gpu-version

pip install tensorflow-gpu

Preprocessing

1. Data

ChEBML 24 Database was used for SMILES data.

SMILES strings were padded with spaces to max_len(default=120) and strings larger than max_len were discarded. Remaining strings are labeled character by character(max_len labels in one string).

2. preprocess.py

Does the following steps:

  1. Downloads chembl_24_1_chemreps.txt.gz
  2. Preprocess SMILES strings
  3. Saves processed data into numpy arrays.

Numpy arrays contains training data, testing data, dictionaries for character <-> label(integer) interchange.

Training

1. Model

Model consists of CNN encoder and CuDNNGRU decoder and defined in vae.py

2. train.py

Does the following steps:

  1. Loads preprcessed data
  2. trains with fit_generator using DataGenerator

Notebooks

Notebooks are here to help after training is done.

This notebook helps to get variational structures when given a SMILES string.

This notebook helps visualizing learned latent space using a plot or tensorboard.

tensorboard visualization example:

image

This notebook helps to get top_k similar molecules measured by euclidean distance in latent space.

About

Variational Autoencoder for Molecules

License:MIT License


Languages

Language:Jupyter Notebook 97.8%Language:Python 2.2%