Mod-QSAR
A modular inverse QSAR pipeline
Overview
Built and tested on WSL using Ubuntu 22.04.1 LTS, has been tested for Ubuntu without WSL and works there as well.
Built using SmilesEnumerator, StringGA, Python (3.10.6) and Julia (1.8.1).
The pipeline works by first taking in a series of PubChem or CHEMBL .csv
files that contain a SMILES string associated with a biotarget.
The pipeline then filteres that dataset such that an equal amount of active and inactive compounds are found within the dataset.
The pipeline starting augmenting the dataset by enumerating over SMILES strings and generating a vocabulary of tokens used in those .csv
files.
After this vocabulary is generated the strings are converted into a series of onehot encoded arrays dumped into a .npy
file.
For each biotarget given in the initial preprocessing phase, a QSAR model is trained to determine whether a compound is active or inactive and the models are saved as .h5
files.
The accuracy of these models can be increased by predicting on a series of augmented strings and then taking the average prediction on those strings.
The optimal amount of strings to augment is calculated by finding the carrying capacity of a differential equation and finding where the model's prediction accuracy reaches that carrying capacity as the amount of augmentations is increased.
After the augmentation hyperparameter is optimized, a genetic algorithm is used to mutate a series of initial existing chemicals and optimizing a series of drug-likeness measures and given QSAR models.
How To
(This project is recommended to run within a virtual environment)
Download dependencies using:
bash ./initialize.sh
(Note that the first run may take a little while since it needs to compile the necessary files but all subsequent runs should be faster)
Note, vocabulary file (examples use vocab.csv
) must be a .csv
file in the following format:
tokens |
---|
C |
F |
- |
3 |
... |
Preprocess datasets using:
cd preprocessor
bash ./preprocessor.sh -f dataset1.csv -f dataset2.csv -t tag1 -t tag2 -n 10 -v vocab.csv -s true
-f
: PubChem bioassay.csv
file, multiple can be specified-t
: String specifying a tag to add to the preprocessed dataset-n
: Positive integer representing amount of augmentations to add-v
: (Optional) filename of vocabulary file, defaults tovocab.csv
-s
: (Optional) boolean as to whether or not to use a sysimage when running Julia component
To use a non PubChem dataset, use a .csv
file in the following format:
PUBCHEM_EXT_DATASOURCE_SMILES | PUBCHEM_ACTIVITY_OUTCOME |
---|---|
CC1=NN(C2=NC(=O)N(C(=O)C2=N1)C)C | Active |
CC1=C2C(=NN1)C(=S)NC(=O)N2 | Inactive |
... | ... |
Generate a default vocab file with default symbols:
cd preprocessor
python default_vocab_generator.py vocab.csv
- First argument : Filename of vocabulary file
Add another dataset using a previously generated vocab.csv
:
cd preprocessor
bash ./add_dataset.sh -f dataset1.csv -t tag1 -n 10 -m 196 -o true -v vocab.csv -s true
-f
: PubChem bioassay.csv
file, multiple can be specified-t
: String specifying a tag to add to the preprocessed dataset, amount of-t
and-f
arguments must be the same-n
: Positive integer representing amount of augmentations to add-m
: Maximum length of tokens, any samples found that are longer are removed from the dataset-o
: Boolean representing whether to ignore or override tokens not found in initial vocabulary-v
: (Optional) filename representing vocabulary file to use (defaults tovocab.csv
)-s
: (Optional) boolean as to whether or not to use a sysimage when running Julia component
Curate datasets using CHEMBL:
cd preprocessor
python3 chembl_dataset_generator.py dataset_args.json -a aggregate_args.json -f true -n 10 -v vocab.csv -s true
- First argument :
.json
file that specifies the target and threshold for activity -a
: Aggregate datasets into singular files-f
: Boolean that states whether to generate.npy
files in same manner aspreprocessor.sh
script-n
: Integer amount greater than or equal to 0 representing amount of augmentations to include inpreprocessor.sh
script-v
: (Optional) filename of vocabulary file, defaults tovocab.csv
-s
: (Optional) boolean as to whether or not to use a sysimage when running Julia component
First .json
file arguments:
- Filename
target_chembl_id
: Valid target ID from CHEMBL databaseactivity_type
: Valid type of activity from CHEMBL database, (IC50
orEC50
for example)tag
: String representing tag to use inpreprocessor.sh
scriptmin
: Float minimum threshold for being considered activemax
: Float maximum threshold for being considered active
Example dataset_args.json
:
{
"serotonin_antagonist.csv" : {
"target_chembl_id" : "CHEMBL224",
"activity_type" : "IC50",
"tag" : "sero",
"min" : 0,
"max" : 100
},
"d2_antagonist.csv" : {
"target_chembl_id" : "CHEMBL217",
"activity_type" : "IC50",
"tag": "d2",
"min" : 0,
"max" : 100
},
"d3_antagonist.csv" : {
"target_chembl_id" : "CHEMBL234",
"activity_type" : "IC50",
"tag" : "d3",
"min" : 0,
"max" : 100
}
}
Example aggregate_args.json
:
{
"dopamine_antagonist.csv" : {
"tag" : "dopa",
"filenames": ["d2_antagonist.csv", "d3_antagonist.csv"]
}
}
Add another dataset to a previously generated vocab.csv
using CHEMBL (must be a singular dataset with or without aggregation):
cd preprocessor
python3 ./chembl_add_dataset.py dataset_args.json -a aggregate_args.json -n 10 -m 196 -o true -v vocab.csv -s true
- First argument:
.json
file that specifies the target and threshold for activity -a
: Aggregate datasets into singular files-n
: Positive integer representing amount of augmentations to add-m
: Maximum length of tokens, any samples found that are longer are removed from the dataset-o
: Boolean representing whether to ignore or override tokens not found in initial vocabulary-v
: (Optional) filename of vocabulary to use (defaults tovocab.csv
)-s
: (Optional) boolean as to whether or not to use a sysimage when running Julia component
(See above examples for dataset_args.json
and aggregate_args.json
)
Train QSAR model:
cd predictor
python3 train_keras_rnn.py X.npy Y.npy 100 name testX.npy testY.npy
- First argument : A
.npy
file containing the tokenized X features - Second argument : Must be labels in a
.npy
file - Third argument : Amount of epochs to train for (positive integer), final argument specifies tag to name the model and training history
- Fourth argument : Name to add to model and model history outfiles
- Fifth argument : (Optional) name of file to dump features given dataset into for
optimize_n.sh
- Sixth argument : (Optional) name of file to dump labels given dataset into for
optimize_n.sh
(Optional) Optimize accuracy post training with additional augmentations:
cd predictor
bash ./optimize_n.sh -x testX.npy -y testY.npy -m rnn_model.h5 -v ../preprocessor/vocab.csv -s 10 -a 2 -b 11 -i 2
-x
:.npy
file containing tokenized X features-y
:.npy
file containing labels-m
: Path of model to optimize-v
: (Optional) filename of vocabulary to use (defaults to../preprocessor/vocab.csv
)-s
: (Optional) positive integer greater than 0, program will sample 1 in-s
entries in-x
and-y
to evaluate-a
: (Optional) positive integer greater than 0, minimum part of augmentation number range, defaults to 2-b
: (Optional) positive integer greater than 0, maximum part of augmentation number range, defaults to 11-i
: (Optional) positive integer greater than 0, increment of augmentation number range, defaults to 2
Generate chemicals:
cd inverse_qsar
python3 inverse_qsar_cli.py args.json chemicals_file.csv fitness_scores.csv
- First argument : A
.json
file containing all arguments to be used while generatingpopulation_size
: Initial size of chemicals pool (positive integer)mating_pool_size
: Size of genetic mating pool (positive integer)generations
: Number of iterations (positive integer)mutation_rate
: Chance that chemical will be changed randomly during training (float between 0 and 1)seed
: Random seed (null or an integer)average_size
: Average size of chemical (positive integer or float)size_stdev
: Average standard deviation of chemical (positive integer or float)string_type
: Type of string formatting (recommended to useSMILES
as input or specify a filepath to a file in the same format asvocab.csv
)scoring_function
: Type of scoring function to usestrict
: Whether to completely eliminate molecules based on a weight threshold (boolean)augment
: List containing boolean whether to augment data for model scoring function and integer how many times to augment (positive integer)max_len
: Maximum token length of molecules (positive integer or automatically grabmax_len
by providingnull
as an argument)max_score
: Score to stop at (float)prune_population
: Trim size of population (boolean)target
: Target value for scoring functions to optimize (list of values between 0 and 1)weight
: Weight to apply to each output of scoring function (list of floats)file_name
: Pre-existing file of molecules to draw initial population from (filepath)vocab
: Pre-existing file containing the vocabulary mapping (filepath)
- Second argument : A
.csv
to dump the molecules into - Third argument : (Optional)
.csv
file to dump fitness scores after training
Example args.json
:
{
"population_size" : 100,
"mating_pool_size" : 100,
"generations" : 20,
"mutation_rate" : 0.05,
"seed" : null,
"average_size" : 375.0,
"size_stdev" : 100.0,
"string_type" : "SMILES",
"scoring_function" : ["dopa_rnn_model.h5", "sero_rnn_model.h5", "custom_lipinski", "pains", "limit_rings"],
"strict": true,
"threads": 2,
"augment": [true, 5],
"max_len": 196,
"max_score" : 1.0,
"prune_population" : true,
"target" : [1, 0, 1, 0, 1, 1, 1],
"weight" : [1, 1, 1, 1, 1, 1, 1],
"file_name" : "cl_f.smi",
"vocab" : "../preprocessor/vocab.csv"
}
Postprocess generated .csv
of molecular candidates into image files and check to see if any are already known:
cd inverse_qsar
python3 postprocessor.py ./generated_drugs/images files.csv names.csv
- First argument : Directory to write images to
- Second argument :
.csv
file containing.csv
files generated byinverse_qsar_cli.py
to be aggregated (also works with singular entry) - Third argument : Optional argument containing name to write any chemicals that matched a known database
files.csv
should be in the following format:
files |
---|
chemicals_1.csv |
chemicals_2.csv |
... |
Scoring Functions
model
: Uses the QSAR model as a scoring functionlipinski
: Uses Lipinski's Rule of Five, (1 if true 0 otherwise)qed
: Uses QED drug-likeness measure, (1 if true 0 otherwise)ghose
: Uses Ghose drug-likeness measure, (1 if true 0 otherwise)limit_rings
: Returns 0 if molecule has carbon rings larger than 6 atomspains_filter
: Filters out PAINS, (1 if has PAINS substructures 0 otherwise)custom_lipinski
: Uses a custom weighted version of Lipinski's Rule of Five, (1 if true 0 otherwise)bbb_permeable
: Checks if molecule is blood brain barrier permeable using the BOILED-egg method, (1 if true 0 otherwise)gastro_absorption
: Checks if molecule is has high gastrointestinal absorption using the BOILED-egg method, (1 if true 0 otherwise)
Custom Scoring Functions
Use the format "./filepath/to/python_file.py:function_name"
as an element in a list passed in the scoring_function
argument where a :
deliminates what is the filepath and what is the function name. The custom function must return a float. Custom functions must be specified last.
Todo
- Change preprocessing step to append to
.npy
file as loop progresses - Debug information
- Add hyperparameter optimization
- Add Flux model integration