The dataset

Unzip the data

The data has been compressed with 7zip. It can be unzipped with p7zip on Linux, or 7zip on Windows.

Description

The benchmark data will be placed in the dataset subdirectories of the SingleAssay and MultiAssay directories. There are 1000 files corresponding to the 1000 repetitions. Each file contains several thousand lines of CHEMBL IDs, where the first ID is the reference molecule, and the other four are molecules are increasing distance (decreasing similarity) to the reference.

How to reproduce the results

Requirements

Python 2.7
NumPy
SciPy
RDKit (2015.09.2)

Optional but needed to generate the graph depictions

dot (provided by GraphViz)

Get ChEMBL

Download ChEMBL20 as an SDF file
Convert it to a SMILES file where the title field is the numeric portion of the CHEMBLID. The details are left to the reader. Once done, the file should look something like this:

    Cc1cc(cn1C)c2csc(n2)N=C(N)N	153534
    COc1cc(ccc1OC(=O)C23CC4CC(C2)CC(C4)C3)CC=C	265174
    Cc1cccc(c1)N2CCN(CC2)CCCON3C(=O)c4ccccc4C3=O	264472
    c1ccc2c(c1)n(c(=N)s2)CCN3CCC(CC3)c4ccc(cc4)F	405225

Name this file chembl_20.smi and place it in the benchlib directory.

Run and analyse the benchmark

python 1-Similarities.py
python 2-Correlations.py
python 3-AnalyseResults.py
dot SingleAssay\graph.gv -T png > singleassay.png
dot MultiAssay\graph.gv -T png > multiassay.png

Notes

Running the Python scripts on one CPU may take some time. To speed things up, you may wish to parallelise the main loops. This is left to the reader.

About

Structural similarity benchmark, with Docker improvements.

BSD 2-Clause "Simplified" License

Languages

Language:Python 98.7%Language:Dockerfile 1.3%