TransformMolecules

Machine learning transformer model for generative chemistry. Code written by Emma P. Tysinger and Anton V. Sinitskiy. Paper to be published.

Dependencies

Install mmpdb before running preparing_data.sh. For quick installation, run pip install mmpdb.
RDKit is also required for preparing_data.sh and predictions.sh. RDKit is available at http://rdkit.org/.
Install OpenNMT-py before running training_model.sh and predictions.sh. For installation run:

git clone https://github.com/OpenNMT/OpenNMT-py.git
cd OpenNMT-py
python setup.py install

Install selfies before running preparing_data.sh and predictions.sh. For quick installation, run pip install selfies.

Usage

TransformMolecules contains 5 modules:

Preparing Data: Generates pairs of molecules, converts molecule representation to selfies
Dataset Split: Splits data into test/train/validation sets
Training Model: Trains transformer model on data
Predictions: Generate molecule predictions, converts molecule representation back to SMILES and evaluates predictions
Generating Plots

Preparing Data

Run the following commands: Create a directory $DATA_DIR which will store all files related to the input dataset.

mkdir -p $DATA_DIR

First prepare the MMPDB input file and remove all stereochemistry and salts. The input to ./scripts/mmpdb_prep.py (e.g., with ChEMBL data) must be a CSV file with a header (further called as /PATH/data.csv and $DATA), and at least three columns containing SMILES representation of compounds ($SMI_COL), ids of compounds ($ID_COL) and year of experimentation for the compounds (year). $DATA_ID refers to the name of the dataset.

python ./scripts/mmpdb_prep.py --in $DATA --out $DATA_DIR/${DATA_ID}_mmpdb_input.csv --smiles $SMI_COL --ids $ID_COL
python ./scripts/clear_stereochemistry.py --in $DATA_DIR/${DATA_ID}_mmpdb_input.csv --out $DATA_DIR/${DATA_ID}_mmpdb_input_nostereo.csv

Pair structurally similar molecules using MMPDB. Note that mmpdb fragment can take hours to run.

mmpdb fragment --delimiter comma --has-header $DATA_DIR/${DATA_ID}_mmpdb_input_nostereo.csv -o $DATA_DIR/${DATA_ID}.fragments
mmpdb index $DATA_DIR/${DATA_ID}.fragments -o  $DATA_DIR/${DATA_ID}_pairs.csv --out 'csv'
python ./scripts/parsing_pairs.py --in $DATA_DIR/${DATA_ID}_pairs.csv --out $DATA_DIR/${DATA_ID}_pairs_parsed.csv

Count the number of pairs representing each smirk, which will later be used to filter the data.

python ./scripts/counting_smirks.py --in $DATA_DIR/${DATA_ID}_pairs_parsed.csv --out $DATA_DIR/${DATA_ID}_counted.csv

Filter paired dataset by excluding all smirks with a count below a defined threshold ($EXCLUDE) and randomly sample a constant number ($SAMPLE_SIZE) from all included smirks.

python ./scripts/filtering_data.py --in $DATA_DIR/${DATA_ID}_pairs_parsed.csv --all $DATA --smirks $DATA_DIR/${DATA_ID}_counted.csv --out $DATA_DIR/${DATA_ID}_filtered.csv --size $SAMPLE_SIZE --exclude $EXCLUDE

It is strongly recommended that you submit this job to a queue using a slurm script, because it may take up to a day to complete. An example of a slurm script can be found in scripts/preparing_data.sh.

sbatch --export=DATA_ID=test_dataset,DATA_DIR=/PATH/test_dataset,DATA=/PATH/data.csv,SMI_COL='canonical_smiles',ID_COL='chembl_id',SAMPLE_SIZE=3,EXCLUDE=2 ./slurm_scripts/preparing_data.sh

Dataset Split

Run the following commands, where $DATA_DIR is the directory with the paired molecules and $RUN_DIR is the directory to create to store all data related to a single dataset split. For year thresholds, all pairs with both molecules discovered before $TRAIN_YEAR will be in the training set, pairs with at least one molecule discovered later than $TEST_YEAR will be in the test set and all other pairs will be in the validation set. $TIMESTAMPS must be a CSV file with a SMILES column and a years column. If the --augment flag is used, for all pairs (mol1, mol2) added to the training set, the reciprocal (mol2, mol1) will also be added.

mkdir -p $RUN_DIR

## Dataset split
python ./scripts/dataset_split.py --in $DATA_DIR/${DATA_ID}_filtered.csv --timestamps $TIMESTAMPS --out $RUN_DIR --year_train $TRAIN_YEAR --year_test $TEST_YEAR --augment

It is strongly recommended that you submit this job to a queue using a slurm script, because it may take up to 5 hours to run. An example of a slurm script can be found in scripts/dataset_split.sh.

sbatch --export=DATA_ID=test_dataset,DATA_DIR=/PATH/to/paired_dataset,TIMESTAMPS=/PATH/data.csv,RUN_DIR=/PATH/to/training_dataset,TRAIN_YEAR_CUTOFF=2009,VAL_YEAR_CUTOFF=2014 ./slurm_scripts/dataset_split.sh

Training Model

Install OpenNMT-py as described above (section Dependencies).

Run the following commands: Create a directory $MODEL_DIR which will store models.

mkdir -p $MODEL_DIR

Next calculate size of the training set, number of steps per epoch, number of total training steps and how often to save models based on $BATCH_SIZE, $TRAIN_EPOCHS and $SAVE_EPOCHS.

export TRAIN_SIZE=$(cat $RUN_DIR/src-train.txt | wc -l)
export EPOCH_STEPS=$(($TRAIN_SIZE/$BATCH_SIZE))
export TRAIN_STEPS=$(($EPOCH_STEPS*$TRAIN_EPOCHS))
export SAVE_STEPS=$(($EPOCH_STEPS*$SAVE_EPOCHS))
export VALID_STEPS=$(($TRAIN_STEPS+1))

Build the config yaml file for training parameters where $DATA_DIR is the directory with the paired molecules and $RUN_DIR is the directory storing all data related to a single dataset split.

cat << EOF > $DATA_DIR/config.yaml
## Where the vocab(s) will be written
src_vocab: $RUN_DIR/vocab.src
tgt_vocab: $RUN_DIR/vocab.tgt
# Corpus opts:
data:
    corpus_1:
        path_src: $RUN_DIR/src-train.txt
        path_tgt: $RUN_DIR/tgt-train.txt
    valid:
        path_src: $RUN_DIR/src-val.txt
        path_tgt: $RUN_DIR/tgt-val.txt
EOF

Build vocab and start training the transformer model.

onmt_build_vocab -config $RUN_DIR/config.yaml -save_data $RUN_DIR/data -n_samples -1
onmt_train -config $RUN_DIR/config.yaml -save_model $MODEL_DIR/$MODEL_ID -train_steps $TRAIN_STEPS -valid_steps $VALID_STEPS -save_checkpoint_steps $SAVE_STEPS -batch_size $BATCH_SIZE -world_size 1 -gpu_ranks 0

Finally rename models to be more intuitive.

python ./scripts/renaming_models.py --models $MODEL_DIR --batch_size $BATCH_SIZE --train_size $TRAIN_SIZE

It is strongly recommended that you submit this job to a queue using a slurm script, because it may take multiple days based on the dataset size. An example of a slurm script can be found in scripts/training_model.sh.

sbatch --export=RUN_DIR=/PATH/to/training_dataset,MODEL_ID=test_dataset,MODEL_DIR=/PATH/model/name_of_run,TRAIN_EPOCHS=50,SAVE_EPOCHS=5,BATCH_SIZE=100
./slurm_scripts/training_model.sh

To resume training from a checkpoint model run the following command, where $MODEL_PATH is the path to the checkpoint model.

onmt_train -config $RUN_DIR/config.yaml -save_model $MODEL_DIR/$MODEL_ID -train_steps $TRAIN_STEPS -valid_steps $TRAIN_STEPS -save_checkpoint_steps $SAVE_STEPS -batch_size $BATCH_SIZE -world_size 1 -gpu_ranks 0 -train_from $MODEL_PATH -reset_optim all

To get perplexity scores for data other than validation data, run the following command and look at the GOLD ppl score in the error file:

sbatch --export=MODEL_ID=test_dataset,MODEL_DIR=/PATH/model/name_of_run,SRC_DATA=/PATH/to/input_dataset,TGT_DATA=/PATH/to/true_dataset,EPOCH_NUM=10,OUTPUT_DIR=/PATH/to/predictions ./slurm_scripts/model_scoring.sh

Predictions

Run the following commands: Create a txt file with all unique validation molecules and generate new structure predictions with the validation molecules as input to a trained model with $MODEL_ID and at $EPOCH_NUM. $RUN_DIR is the directory storing all data related to a single dataset split, $MODEL_DIR is the directory storing the trained model and $OUTPUT_DIR is the directory where predictions will be saved.

mkdir -p $OUTPUT_DIR
cat $RUN_DIR/src-val.txt $RUN_DIR/tgt-val.txt | sort | uniq > $RUN_DIR/val-unique.txt
if [ ! -s $OUTPUT_DIR/pred_selfies_epoch_${EPOCH_NUM}.txt ]; then
    onmt_translate --model $MODEL_DIR/${MODEL_ID}_epoch_${EPOCH_NUM}.pt --src $RUN_DIR/val-unique.txt --output $OUTPUT_DIR/pred_selfies_epoch_${EPOCH_NUM}.txt --replace_unk --seed 1 --gpu 0
fi

Convert SELFIEs of generated molecules to SMILES and get scaffolds of all input validation and generated molecules.

python ./scripts/selfies_to_smiles.py --in1 $RUN_DIR/pred_selfies_epoch_${EPOCH_NUM}.txt --in2 $RUN_DIR/src-val-unique.txt --out $RUN_DIR/pred_smiles_epoch_${EPOCH_NUM}.csv
python ./scripts/scaffolding.py --in $RUN_DIR/pred_smiles_epoch_${EPOCH_NUM}.csv --out $RUN_DIR/pred_smiles_epoch_${EPOCH_NUM}.csv --column1 'structure' --column2 'id'

Score generated molecules based on number of scaffold changes, number of r-group changes, number of unique scaffolds and number of new scaffolds. $METRICS_TABLE is the csv file where scores will be added. If csv doesn't exist yet, one will be created.

## Score model predictions
python ./scripts/scoring.py --in $RUN_DIR/pred_smiles_epoch_${EPOCH_NUM}.csv --metrics_table $METRICS_TABLE --training_data $RUN_DIR/train.csv --model ${MODEL_ID}_epoch_${EPOCH_NUM} --change_count --scaffolds

It is strongly recommended that you submit this job to a queue using a slurm script, because it may take up to 1 day based on the dataset size. An example of a slurm script can be found in scripts/predictions.sh.

sbatch --export=RUN_DIR=/PATH/to/training_dataset,MODEL_ID=test_dataset,MODEL_DIR=MODEL_DIR=/PATH/model/name_of_run,OUTPUT_DIR=/PATH/to/predictions,EPOCH_NUM=10,METRICS_TABLE=/PATH/model_scores.csv ./slurm_scripts/predictions.sh

An additional slurm script can be run to determine the new smirks predicted by the model by running the following command, where $THRESHOLD is the threshold of smirks count above which pngs of smirks will be created and $PNG_DEST is an existant directory or one that will be created to save SMIRK pngs to.

sbatch --export=EPOCH_NUM=10,OUTPUT_DIR=/PATH/to/predictions,DATA_DIR=/PATH/to/paired_dataset,DATA_ID=test_dataset,THRESHOLD=2,PNG_DEST=/DIRECTORY/for/pngs ./slurm_scripts/new_smirks.sh

Generating Plots

Scaffolding scores

To generate line plots for scaffold scores over multiple epochs run the following command. Specify which runs in the metrics_table.csv to plot with $SUBSET which is a string identifier in the model names.

mkdir -p $PLOT_DIR
python ./scripts/generating_plots.py --metrics_table $METRICS_TABLE --out $PLOT_DIR --subset $SUBSET, --type scores

Molecular Property Histograms

To generate histogram plots comparing molecular properties of generated molecules compared to the input molecules of the model for training, first get the molecular properties of generated and input molecules with the following commands:

python ./scripts/molecular_properties.py --in $DATA_DIR/${DATA_ID}_mmpdb_input_nostereo.csv --out $DATA_DIR/${DATA_ID}_molecular_properties.csv --smi_col $SMI_COL  
python ./scripts/molecular_properties.py --in $OUTPUT_DIR/pred_smiles_epoch_${EPOCH_NUM}.csv --out $OUTPUT_DIR/pred_smiles_epoch_${EPOCH_NUM}.csv \--smi_col 'structure'

Next, to generate the histograms run the following command:

mkdir -p $PLOT_DIR
python ./scripts/generating_plots.py --in1 $OUTPUT_DIR/pred_smiles_epoch_${EPOCH_NUM}.csv --in2 $DATA_DIR/${DATA_ID}_molecular_properties.csv --out $PLOT_DIR --type molecular_properties

Training Curves and Scaffolding Scores

To generate stacks plots with perplexity and scaffolding scores first parse the training log and info files to generate csv files with perplexity and accuracy scores. $IN_FILE is a txt file with information about the model including the split name and filter used, and $OUTDEST is the name of the directory to save all files generated for this script.

mkdir -p $OUTDEST
python ./scripts/training_curves.py --in $IN_FILE --out $OUTDEST/training_model_info_${MODEL_NUM}.csv --parse_type info

export NAME=validation_accuracy
grep 'Validation accuracy' $ERR_FILE > $OUTDEST/${NAME}_${MODEL_NUM}.err
python ./scripts/training_curves.py --in $OUTDEST/${NAME}_${MODEL_NUM}.err --out $OUTDEST/${NAME}_${MODEL_NUM}.csv --name $NAME --parse_type val
    
export NAME=validation_perplexity
grep 'Validation perplexity' $ERR_FILE > $OUTDEST/${NAME}_${MODEL_NUM}.err
python ./scripts/training_curves.py --in $OUTDEST/${NAME}_${MODEL_NUM}.err --out $OUTDEST/${NAME}_${MODEL_NUM}.csv --name $NAME --parse_type val
 
export NAME=training
grep 'Start training loop and validate' $ERR_FILE > $OUTDEST/${NAME}_${MODEL_NUM}.err
grep 'acc:' $ERR_FILE >> $OUTDEST/${NAME}_${MODEL_NUM}.err
python ./scripts/training_curves.py --in $OUTDEST/${NAME}_${MODEL_NUM}.err --out $OUTDEST/${NAME}_${MODEL_NUM}.csv --name $NAME --outpng $OUTDEST --parse_type train

Finally run the following command to generate the stacked plots. Make sure the $METRICS_TABLE csv file contained the scores for the split and filter being used.

mkdir -p $PLOTDEST
python /gpfs/workspace/users/tysine/Transformer/training_curves.py --val_acc $OUTDEST/validation_accuracy_${MODEL_NUM}.csv --val_ppl $OUTDEST/validation_perplexity_${MODEL_NUM}.csv --train $OUTDEST/training_${MODEL_NUM}.csv --info $OUTDEST/training_model_info_${MODEL_NUM}.csv --metrics $METRICS_TABLE --outpng $PLOTDEST --epoch_cutoff $EPOCH_CUTOFF --parse_type plot

The following slurm script can be run to automate all these steps, where $MODEL_NUM is the job id when training the model:

sbatch --export=$IN_FILE=/PATH/to/model_info.txt,ERR_FILE=/PATH/to/training_log.txt,OUTDEST=/PATH/to/output_directory,MODEL_NUM=model_num,METRICS_TABLE=/PATH/to/metrics.csv,PLOTDEST=directory_to_save_plots,EPOCH_CUTOFF=32 ./slurm_scripts/training_curves.sh

Held-Out Target-Specific Data

Scripts related to filtering datasets for target-specific data require a dataset of molecules with a column tid refering to the target id of the molecule's target as well as another dataset (refered to as \PATH\to\target_id_dataset) mapping target id's (tid) to their chembl_id. This target_id_dataset can be downloaded from chembl_dataset.zip and is titled target_information.csv. To generate the datasets with no target-specific data run the filtering scripts with additional inputs including the name of the target ($TARGET):

python ./scripts/filtering_data.py --in $DATA_DIR/${DATA_ID}_pairs_parsed.csv --all $DATA --smirks $DATA_DIR/${DATA_ID}_counted.csv --out $DATA_DIR/${DATA_ID}_filtered_${TARGET}.csv --size $SAMPLE_SIZE --exclude $EXCLUDE --target 'target chembl_id' --tid \PATH\to\target_id_dataset

Run the Dataset Split and Training Model modules to same as described earlier with $DATA_DIR/${DATA_ID}_filtered_${TARGET}.csv. Next split the target-specific data temporally and by activity.

To generate images for the experimental molecules for a specific target, the most similar generated molecule(calculated with Tanimoto similarity) and the input molecule for given generated molecule run the following command, where $SMI_COL is the name of the column with SMILEs representations in --in1 and $PNG_DEST is an existant directory or one that will be created to save molecule pngs to.

python ./scripts/tanimoto_target_specific.py --in1 /PATH/to/target_molecules --in2 /PATH/to/generated_molecules --smi_col $SMI_COL --out $OUTPUT_DIR/top_generated_per_experimental.csv --png_dest $PNG_DEST --generate_png

jingqiong / transform-molecules