COMPETITION ROUND 2: A Predictive Model for Series 4

Question

COMPETITION ROUND 2: A Predictive Model for Series 4

edwintse opened this issue 5 years ago · comments

UPDATE: Round 2 has now concluded. Thanks to all who participated! The results announcement can be found here.

OSM will be launching the second round of the predictive modelling competition on August 1st. This will build upon the first round which was run in 2016 (results here). All relevant background can be found in the previous two links and on the Wiki (tab above). Submissions will be allowed up to the end of the day on September 11th.

This aim of the competition is to develop a computational model that predicts new, potent molecules in OSM Series 4.

The target of these molecules is strongly suspected to be PfATP4, since there has so far been essentially a perfect correlation between activity of molecules in this series vs the parasite and in an assay that measured ion regulation, used as a proxy for activity vs PfATP4. PfATP4 is an important target for the development of new drugs for malaria.

We are providing a dataset of actives and inactives. The challenge is to use the data to develop a model that allows us to (better) design compounds in Series 4 that will be active against that target. This competition is part of Open Source Malaria, meaning that everything need to adhere to the Six Laws.

This round of the competition is funded by the AI3SD+ network. Details of the submitted proposal can be found here (#2). The funding allows us to actually make the molecules that are proposed to be active.

Competition Timeline

Competition launch: The competition will run from 01/08/19 to 11/09/19.
Paper write-up: This will happen as the competition is being run and will be submitted to the forthcoming special issue of the Beilstein Journal of Organic Chemistry.
Judging and results: A panel (to be announced) will evaluate the models against an undisclosed test set to determine the model(s) best able to predict activity of knowns.
Synthesis of top compounds: With the best performing model(s) as judged above, the relevant submitters will be asked to suggest new potent Series 4 compounds. These will be synthesised and biologically evaluated to determine the predictive capabilities of the models.

The Competition
OSM will provide:

A dataset containing actives and inactive compounds against PfATP4 along with their in vitro potencies (here). This list has been updated to include the more recent Pathogen Box results from the Kirk lab that was used as the test set in the last competition.
The Master Chemical List which contains activity data for all OSM compounds from Series 1-4.
Jeremy Horst's Homology Model built from crystal structures of the closest mammalian homolog (SERCA)
PfATP4-PNAS2014.pdb.txt
Details of the relevant mutations known to be associated with resistance.

Submission Rules:

Entries may either be submitted to directly to GitHub (uploaded in the Submitted Models folder in the Code tab above) or be uploaded onto an ELN and a link posted in this repository.
Entrants can work individually or in teams (no limit to team size).
Entrants must work openly during the competition. This doesn't necessarily mean that inputs have to be logged in real time (although that is strongly encouraged), but entries that have not openly deposited working data on a regular basis prior to the deadline will not be accepted.
Open Electronic Notebooks (ELN) such as Labtrove or LabArchives can be useful places to post data and work collaboratively. For example, Ho Leung Ng's ELN can be viewed and commented on here. Please note that LabTrove authors are not alerted when a comment is added to an entry so GitHub is a useful place to tag others.
Entrants must agree to their work's incorporation into a future OSM journal publication(s).
Competition winner(s) will be authors on any relevant future paper(s).
Any valid* entries will at least be acknowledged on any relevant future paper(s) and if the contribution is significant may lead to authorship.

How will entries be assessed?
There is a relatively high confidence level that PfATP4 is the molecular target for Series 4 (i.e. compounds that are potent in vitro show disruption of ion regulation in the PfATP4 assay). Therefore, for this round of the competition, we will be focussing on the prediction of active Series 4 compounds (rather than the prediction of any active compounds vs PfATP4) since the two should correlate.

For the final submission, entrants will predict the potencies of an undisclosed set of Series 4 compounds (to be provided at a later date)
A judging panel (to be announced) will evaluate these predictions in comparison with experimental data to determine the winner(s)

What's the prize
Two prizes will be awarded, one for a private sector entry and one for a public sector entry.
...also the opportunity to contribute to our understanding of a new class of antimalarials
...and authorship on a resulting peer-reviewed publication arising from the OSM consortium

*A 'valid' entry is one that stands up to the rigour expected from published in silico models. Judges are entitled to use discretion in the case of unconventional entrants, for example those from people with no formal training such as high school students.

Comments and questions can go below. The above rules/guidance will be periodically updated.

Girinath G. Pillai · Answer 1 · Sun Aug 04 2019 20:20:40 GMT+0800 (China Standard Time)

Interesting, our team started reviewing the previous runs, datasets etc. We hope to have some promising models.

Vito Spadavecchio · Answer 2 · Mon Aug 05 2019 01:14:15 GMT+0800 (China Standard Time)

Very interested in participating, and iterating on the last competition. Is there a formal definition for the core of Series 4? Curious to know where we can enumerate and where we can't.

Edwin Tse · Answer 3 · Mon Aug 05 2019 20:19:41 GMT+0800 (China Standard Time)

@spadavec I think it would be best to stick to the triazolopyrazine core with substitutents in the northwest and northeast positions (e.g. MMV897698 as a simple example) considering the better potencies that we typically get with those.

Jacob Silterra · Answer 4 · Wed Aug 07 2019 03:18:11 GMT+0800 (China Standard Time)

Was there ever a full formal writeup for the first round? I see at #538 that it was delayed due to data embargoes and such, hopefully those have passed.

Edwin Tse · Answer 5 · Wed Aug 07 2019 16:07:49 GMT+0800 (China Standard Time)

@jsilter There hasn't been yet, but I am in the process of writing it up on the wiki in this repo so check back there soon. At the same time I am also drafting up this info for the paper (I'll create a new issue about this shortly).

Benedict Irwin · Answer 6 · Thu Aug 08 2019 16:38:16 GMT+0800 (China Standard Time)

Not clear exactly what we should leave in the submitted models folder.
Would a prediction for missing values of each compound already in the sheet suffice?
Or does it have to be a binary capable of taking a new compound SMILES and outputting the predicted activity?

By working openly, does this mean I can just place my data etc. in a repository e.g.
https://github.com/BenedictIrwin/OSM
and update that as I make progress?

Edwin Tse · Answer 7 · Thu Aug 08 2019 19:42:20 GMT+0800 (China Standard Time)

@BenedictIrwin Hi, I have added some details about what will be required for submission to the original post above, but it is more the latter. In short, all entrants will be provided with the molecular identifiers (e.g. SMILES) for a set of existing Series 4 compounds (where we have not revealed the experimental potencies) and you will be required to predict the potencies for these compounds.

Yes, working openly means that at any stage, if someone wants to see the progress you've made, the can easily look at your work on an ELN or on Github. Feel free to place your data/working in a repository (either this one or your own) and update/provide links as you make progress.

Willem van Hoorn · Answer 8 · Thu Aug 15 2019 20:57:52 GMT+0800 (China Standard Time)

Hi, I try to get my head around the provided activity data:

All data is in Google Sheet 'Ion Regulation Data for OSM Competition' (http://tinyurl.com/OSM-Series4CompData)? If this is the case what is the relevance of the Master Chemical List (https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/edit#gid=510297618)?

Re the data in Sheet 'Ion Regulation Data for OSM Competition':
2. The red/brown highlights indicate missing data and/or structures, i.e. entries that can be ignored?
3. What is the relevance of the column 'Ion Regulation Activity'? If relevant, what to do with missing data?
4. Rows 608-835 and 960-1278 do not contain activity data, should these be ignored, treated as prediction set, other?
5. Is the data in row 836-959 any different from the data in row 2-607? Why is it separate since the first block is sorted by activity?
6. Do you have a cut-off when to classify a compound as 'active', something like Potency vs Parasite (uMol) <= 1 uM?
7. There is no test set provided as yet? If we generate the models that we can't share since they run on a proprietary platform (which will most likely be our case) how is model performance compared between entries?

Willem

Edwin Tse · Answer 9 · Thu Aug 15 2019 22:18:08 GMT+0800 (China Standard Time)

@wvanhoorn I'll try to answer these as best as I can.

The compounds in the "Ion Regulation Data for OSM Competition" sheet have associated PfATP4 data (i.e. do they have ion regulation activity or not). However, this list contains non-OSM compounds as well. The "Master Chemical List" is the complete list of OSM compounds from Series 1, 3 and 4 with in vitro potencies (n.b. any compound from Series 1 is also known to be inactive against PfATP4). Round 1 of the competition was more focussed on the prediction of active compounds against PfATP4 (not limited to Series 4). For this round, we are looking for predictions for the activities of Series 4 compounds specifically so you can use the Master Chemical List to train your models.
Yes, those entries can be ignored.
Ion regulation activity indicates whether or not it is active in the PfATP4 assay (1 means the compound shows ion regulation activity, 0 means it doesn't). In the case of Series 4, we see correlation between PfATP4 activity and in vitro potency, so any OSM compound in the list should be relatively potent. Any OSM compound without a number in this column can be found in the Master List and can be used for training the predictions.
The compounds in these rows are from the MMV Malaria Box and Pathogen Box and haven't been evaluated against the parasite. Considering that these compounds are all structurally different from Series 4 compounds, I'm not sure how helpful they will be for developing a model to predict the activities of Series 4 compounds specifically, so perhaps it's better to ignore them?
No difference. The data in rows 836-959 were just added more recently and haven't been sorted.
Generally, our compounds as classified as active if they are <1 uM, weakly active between 1-2.5 uM, and inactive >2.5 uM.
Yes, the final test set will be provided at a later data. It's understandable that the model itself won't be able to be shared. We are not focused as much on the actual method, but the accuracy of the prediction. Each submission will need to provide the predicted potencies for this test set. By comparing these predictions with the experimental data for the test set, we can determine which models perform the best. The best model(s) will then be asked to generate new active compounds that will then be synthesised and tested.

Let me know if you have any further questions

Mykola Galushka · Answer 10 · Fri Aug 16 2019 00:49:37 GMT+0800 (China Standard Time)

Hi,

I'm in the process of creating a dataset containing two fields "SMILES" and "Active/Inactive" status. If I ignore all records where "Smiles" are missing and "Ion Regulation Activity" are neither 0 or 1, I got 576 "clean" compounds (510 - inactive and 66 - active)

Taking into consideration @edwintse comments, may I apply the following rule to records where "Ion Regulation Activity" is missing but "Potency vs Parasite (uMol)" is available?

Rule:

if "Potency vs Parasite (uMol)" < 1:
      "Ion Regulation Activity"  = 1;
else:
     "Ion Regulation Activity"  = 0;

Nick

Edwin Tse · Answer 11 · Fri Aug 16 2019 00:57:39 GMT+0800 (China Standard Time)

@mmgalushka This rule could only be applied to OSM Series 4 compounds since we know there is correlation between ion regulation activity and in vitro potency. I don't think this could be accurately applied to the other compounds from the Ion Regulation sheet since there are lots of different structural classes of compounds for which we don't know if there is any correlation.

Vito Spadavecchio · Answer 12 · Fri Aug 16 2019 03:27:21 GMT+0800 (China Standard Time)

Will the activity of the test molecules be measured as their activity in the ion regulation assay, the Pfal EC50 assay, or both? Do we have known tolerances or errors for either assay?

Edwin Tse · Answer 13 · Fri Aug 16 2019 03:44:13 GMT+0800 (China Standard Time)

@spadavec The test compounds will have been measured in the Pfal IC50 assay only. I'm actually not sure about the specifics of the assay tolerances/errors, but I know that the Pfal IC50 assay uses Mefloquine as the standard control with an acceptable pIC50 range of 7.5-7 if that helps.

Mykola Galushka · Answer 14 · Fri Aug 16 2019 15:52:37 GMT+0800 (China Standard Time)

I'm not from BioChem background and a little bit lost in different domain-specific terminology. I'm posting a script which I'm using to clean the dataset.

I exported "Ion Regulation Data for OSM Competition" file in TSV format and applied the following script:

Ion_Regulation_Activity = 2
Smiles = 4

with open('datasets/Ion Regulation Data for OSM Competition.csv', mode='w') as w:
    with open('datasets/Ion Regulation Data for OSM Competition.tsv') as r:

        for record in r.readlines():
            fields = record.split('\t')

            activity_value = fields[Ion_Regulation_Activity].strip()
            smiles_value = fields[Smiles].strip()

            if len(smiles_value) > 0 and activity_value in ['0', '1']:
                    w.write(smiles_value + ',' + activity_value + '\n')

The output is CSV file with to columns "SMILES" and "ACTIVITY". I got 851 compounds in total, where 66 is active.

Am I on the right track? Do I need to consider something else?

Willem van Hoorn · Answer 15 · Fri Aug 16 2019 17:51:13 GMT+0800 (China Standard Time)

I am still confused and it seems that I am not the only one. This runs the risk of becoming a data interpretation/cleaning instead of data modeling competition. Could we therefore settle first on a single file with all relevant data without any irrelevant data (for instance only series 4 compounds if the aim is to only predict series 4 compounds) so that we all depart from the same starting point? And provide a specific description what needs to be modeled. I initially thought the aim was to model 'Potency vs Parasite (uMol)', now it seems it should be 'Ion Regulation Activity' but I am still not sure.

Mykola Galushka · Answer 16 · Fri Aug 16 2019 18:42:58 GMT+0800 (China Standard Time)

I 100% agree with @wvanhoorn. It would beneficial for all teams to have a single file with samples only relevant to this competition, which containing input feature(s) and a target feature.

Edwin Tse · Answer 17 · Fri Aug 16 2019 20:35:39 GMT+0800 (China Standard Time)

Hi all,
Apologies if there has been any confusion. To clarify, the aim of this competition is to predict the Pfal IC50 potencies of Series 4 compounds that are active against PfATP4. This is slightly different to the aim of Round 1 where the aim was more broadly to predict any active compounds against PfATP4.

Both spreadsheets are supposed to be complementary. The idea behind the two are as follows:

Ion regulation spreadsheet
We highly suspect PfATP4 to be the target for the Series 4 compounds (potent Series 4 compounds show activity in the ion regulation assay; this is indicated by a 1 in the ion regulation activity column) but the structure of the target protein has not been solved. This means that we don't know what key interactions our compounds are making with the target. The ion regulation spreadsheet contains all known compounds (from many different chemotypes) that have been experimentally evaluated against PfATP4. All of this structural information (along with the provided homology model and relevant mutations) can be used to aid in discerning any key interactions that might be taking place, and therefore be used to predict new potent Series 4 compounds that exploit these interactions.

Master Chemical List
This list contains all OSM compounds from Series 1-4 with in vitro potencies with the additional knowledge that Series 1 does not target PfATP4 (i.e. 0 for ion regulation activity). As we are specifically looking for predictions on compounds with a triazolopyrazine core, the changes in structural features between Series 4 compounds and their associated in vitro potencies can be used to develop and refine your models.

n.b. The models will be evaluated for their ability to predict the potencies of a test set that consists of Series 4 compounds only.

With that in mind, you are free to use as much or as little of the provided data that you think will best achieve this goal. I believe that by providing all the data, all aspects can be considered when developing the models.

Mykola Galushka · Answer 18 · Fri Aug 16 2019 21:20:28 GMT+0800 (China Standard Time)

I try to make the following statements regarding "specifically" my model.

My model takes only one feature (SMILES) as an input. According to your comments am I right to say that we are trying to predict the potencies, which defined in "Ion Regulation Data for OSM Competition" file under the field "Potency vs Parasite (uMol)"? If this is true, our model should predict "real" values.

To summarize above, we need to build a regression model which predicts "Potency vs Parasite (uMol)" by compound "SMILES". Do I make the right conclusion?

PS: I understand that some potency values can be sourced from "Master Chemical List" file, but at this stage, I just want to concentrate on "Ion Regulation Data for OSM Competition" file.

Edwin Tse · Answer 19 · Fri Aug 16 2019 21:51:36 GMT+0800 (China Standard Time)

@mmgalushka Yes, that's correct. Totally fine to just concentrate on the one file at this stage.

Mykola Galushka · Answer 20 · Fri Aug 16 2019 23:00:41 GMT+0800 (China Standard Time)

Thanks a lot @edwintse!

I used the following Python script to extract records:

Potency_vs_Parasite = 1
Smiles = 4

with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.csv', mode='w') as w:
    with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.tsv') as r:

        for record in r.readlines():
            fields = record.split('\t')

            potency_value = fields[Potency_vs_Parasite].strip()
            smiles_value = fields[Smiles].strip()

            if len(smiles_value) > 0 and len(potency_value) > 0:
                try:
                    float(potency_value) # make sure this is a real value
                    w.write(smiles_value + ',' + potency_value + '\n')
                except:
                    continue

Got the following file;

There are many records which potency exactly 10 and 50, considering that the majority records between 0 and 8.0. Are these values "10s" and "50s" correct?

Edwin Tse · Answer 21 · Fri Aug 16 2019 23:17:02 GMT+0800 (China Standard Time)

Compounds with potency values of 10/50 OR have a potency qualifier of '>' can be treated as inactive. It means that the IC50 values were greater than the max concentration that was tested in the assay.

Vito Spadavecchio · Answer 22 · Sat Aug 17 2019 01:44:13 GMT+0800 (China Standard Time)

@edwintse Thanks for all of the clarification! Just as a follow up, if you consider only S4 compounds that have enough data to contribute to a regression model (e.g. have potency and SMILES strings) there are only ~130 compounds, which is definitely on the low side for an accurate model (typically this number needs to be closer to ~500 for pIC50 values to have an error rate of ~1, which is getting close to on-par with errors in wet measurements of IC50/EC50 values). If we expand the criterion for acceptance to be over/under 1uM (e.g. just a classification job), the accuracy and results should be much better across the board--has that been considered at all for this?

Edwin Tse · Answer 23 · Sat Aug 17 2019 02:45:55 GMT+0800 (China Standard Time)

@spadavec Are these ~130 S4 compounds from the ion regulation spreadsheet that have both potency vs parasite and ion regulation activity? or just potency vs parasite and not ion regulation activity? There should be close to 350 S4 compounds that have potency vs parasite data (all on the Master Chemical List but you have to filter out intermediate structures). Still, this is lower than the desired number of compounds.

It sounds reasonable to me to expand the criterion if that will provide better accuracy/results for the model.

Mykola Galushka · Answer 24 · Sun Aug 18 2019 23:27:21 GMT+0800 (China Standard Time)

In one of the previous posts, @edwintse wrote:

Generally, our compounds as classified as active if they are <1 uM, weakly active between 1-2.5 uM, and inactive >2.5 uM.

It was regarding "Potency vs Parasite" field in the "'Ion Regulation Data for OSM Competition" file.

at the same time another quote:

Ion regulation activity indicates whether or not it is active in the PfATP4 assay (1 means the compound shows ion regulation activity, 0 means it doesn't).

It was regarding "Ion Regulation Activity" field in the "'Ion Regulation Data for OSM Competition" file.

I selected 2 examples (however there are a number of similar samples in dataset) from the "Ion Regulation Data for OSM Competition" file:

SMILES	Potency vs Parasite (uMol)	Ion Regulation Activity
CC1CN(CC(C)O1)C(=O)c2sc3ccccc3c2Cl	0.0255	0
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OC(C)C4=CC=C(F)C(F)=C4)N32	8.586	1

Question:

Why the first compound with "Potency vs Parasite (uMol)" << 1 is inactive according to "Ion Regulation Activity"?
Why the second compound with "Potency vs Parasite (uMol)" >> 1 is active according to "Ion Regulation Activity"?

I'm sorry maybe I completely miss understand these values. I initially thought there is some relation between "Potency vs Parasite (uMol)" and "Ion Regulation Activity"...

Edwin Tse · Answer 25 · Sun Aug 18 2019 23:53:20 GMT+0800 (China Standard Time)

@mmgalushka Not all compounds in the ion regulation data spreadsheet will show correlation between the two assays.

The first compound CC1CN(CC(C)O1)C(=O)c2sc3ccccc3c2Cl was part of the MMV Malaria Box (indicated by M in the Ion Regulation Test Set Column). This box contained 400 potent antimalarial compounds with many different structures that could be used for further development. Only 28 of the 400 compounds were found to have any ion regulation activity. The remaining 372 can be thought as having a different MoA to PfATP4.

A similar thing can be said for the MMV Pathogen Box compounds (indicated by P in the Ion Regulation Test Set Column). Of the 400 compounds (~120 antimalarials; the rest are for other indications), only 11 were found to have ion regulation activity.

Regarding the S4 compound FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OC(C)C4=CC=C(F)C(F)=C4)N32, there is high confidence that there is correlation between potency vs parasite and ion regulation activity, however there were 3 outlier compounds when this batch was evaluated where this relationship didn't hold true (the 3 compounds are shown here).

So essentially, for OSM S4 compounds in general, you can consider this relationship to be correct.

Mykola Galushka · Answer 26 · Mon Aug 19 2019 16:29:45 GMT+0800 (China Standard Time)

Thanks a lot, @edwintse for this clarification!

I still trying to understand the relation between "Potency vs Parasite" and "Ion Regulation Activity". I collected the following stats based on "Ion Regulation Data for OSM Competition" file:

Potency vs Parasite	Ion Regulation "inactive"	Ion Regulation "Active"
(uMol) <= 1 uM	278	40
(uMol) > 1 uM and (uMol) <= 2.5 uM	124	17
(uMol) > 2.5 uM	27	2

PS: I only considered record where both values available and valid: "Potency vs Parasite" and "Ion Regulation Activity".

I'm struggling to understand even we can create the "perfect model" to predict "Potency vs Parasite", how it can help us to predict "Ion Regulation Activity"?

What I'm trying to say, is that even in the interval (uMol) <= 1 uM we have the challenge to classify Ion Regulation Activity correctly...

Edwin Tse · Answer 27 · Mon Aug 19 2019 23:07:03 GMT+0800 (China Standard Time)

Keep in mind, the predictions that we are seeking are targeted for OSM Series 4 compounds. We are not so much interested in predicting ion regulation activity for Series 4, rather, we want to predict the potency vs parasite since we can make the assumption that active Series 4 compounds against the parasite will be active in the ion regulation assay as well.

Mykola Galushka · Answer 28 · Mon Aug 19 2019 23:27:29 GMT+0800 (China Standard Time)

Thanks a lot @edwintse! You are absolutely right. I just mixed up again that dataset contains different experiments. I got it now!

Willem van Hoorn · Answer 29 · Mon Aug 19 2019 23:32:12 GMT+0800 (China Standard Time)

@mmgalushka @edwintse
I have taken the file earlier created by @mmgalushka, calculated the InChiKey from the Smiles, joined the OSM master list on InChiKey and kept records where series = 4. This leaves 194 compounds from Series 4 that have a potency value. Do we now finally have the training set for the competition?

https://docs.google.com/spreadsheets/d/1ReZz-_I90YYtiyEJucgj_i_6ckMQ3Rr1ocDvfPsgkOw/edit?usp=sharing

Willem van Hoorn · Answer 30 · Tue Aug 20 2019 00:07:53 GMT+0800 (China Standard Time)

The posts below seem to contain the prediction set (assuming they are all series 4). Any more to follow?
OpenSourceMalaria/Series4#73 (comment)
OpenSourceMalaria/Series4#71 (comment)

Edwin Tse · Answer 31 · Tue Aug 20 2019 00:24:20 GMT+0800 (China Standard Time)

Yes, those compounds will be used as the prediction set. It seem unlikely that any more will be added considering they would need to be synthesised and sent for testing before the judging occurs. I'll keep you updated if that changes

Willem van Hoorn · Answer 32 · Tue Aug 20 2019 01:17:07 GMT+0800 (China Standard Time)

@edwintse
Could you please also confirm the training set I assembled earlier is correct (with the proviso it only contains series 4 compounds, if people want to include other series the set from @mmgalushka should be used)?

Edwin Tse · Answer 33 · Tue Aug 20 2019 01:24:44 GMT+0800 (China Standard Time)

Yes, that list looks good to me

Mykola Galushka · Answer 34 · Tue Aug 20 2019 07:21:20 GMT+0800 (China Standard Time)

Thanks, @wvanhoorn for sharing the dataset!

I found the following duplicates:

FC1=C(F)C=CC(C(OC)COC2=CN=CC3=NN=C(C4=CC=C(C#N)C=C4)N32)=C1 (9 duplicates)
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCC(C4=CC=CC=C4)CO)N32 (2 duplicates)

Question 1: What potency is correct for the first compound?

FC1=C(F)C=CC(C(OC)COC2=CN=CC3=NN=C(C4=CC=C(C#N)C=C4)N32)=C1 -> 0.1105 or 0.207

Question 2: Are these the same compounds?

The differences only in upper/lower cases, but it might be significant from the chemistry point of view.

OC(COc2cncc3nnc(C1CCCCC1)n23)c4ccccc4
OC(COc2cncc3nnc(c1ccccc1)n23)c4ccccc4

c1ncc2n(c1Oc1cc3c(cc1)CCCC3)c(nn2)c1ccc(cc1)OC(F)F
c1ncc2n(c1Oc1cc3c(cc1)cccc3)c(nn2)c1ccc(cc1)OC(F)F

c1(ccccc1)CCCc1cncc2n1c(nn2)c1ccc(cc1)OC(F)F
c1(ccccc1)CCCC1CNCc2n1c(nn2)c1ccc(cc1)OC(F)F

C1NCc2n(C1CCO)c(nn2)c1ccc(cc1)OC(F)F
c1ncc2n(c1CCO)c(nn2)c1ccc(cc1)OC(F)F

Willem van Hoorn · Answer 35 · Tue Aug 20 2019 17:04:28 GMT+0800 (China Standard Time)

@mmgalushka
Oops. Yes the duplicates were generated by me during the join, I assumed the master list was deduplicated but it's not. I will redo it a bit more carefully and share.

Re 1: a factor of two in measured IC50 or EC50 is normal. In the master list you can find the raw data and see that values easily differ a factor of two or more. The single data points in the competition list are actually averages of multiple measurements in different labs.

@edwintse: this opens another question: it seems that IC50 and Ki data have been averaged even if they are from the same lab. For instance the first entry OSM-A-1:
Pfal IC50 (Guy) = 3.05
Pfal (K1) IC50 (Guy) = 4.379
PfaI EC50 uMol (Mean) = 3.7145 (which is the average of the above two numbers)
In my understanding IC50 and Ki are two different measurements (relative vs absolute) that can be derived from a single experiment, averaging these two does not make sense since it is averaging the same single observation interpreted differently.

And another issue: OSM-S-35 has duplicate entries for a single assay, at least that is how I interpret the semi-colons. However, in the spreadsheet this equates to a string, not a numerical value and these are all ignored in the final average:
Pfal IC50 (GSK) = 0.036; 0.012
Pfal IC50 (Avery) = 0.026; 0.038
Pfal IC50 (Ralph) = 0.011
PfaI EC50 uMol (Mean) = 0.011 (average of the above 5 numbers = 0.0246)

Looks like there is some more data cleaning to do (which normally is 90% of the effort of building a model so not too bad so far).

Re 2: these structures are different! Lower case represents aromatic atoms, upper case aliphatic.

Willem van Hoorn · Answer 36 · Tue Aug 20 2019 22:21:31 GMT+0800 (China Standard Time)

My hopefully last attempt: I have taken the master list as starting point since that seems to contain the original data. All work was done on a snapshot downloaded today (20 Aug 2019).

The columns 'PfaI EC50 uMol (Mean) Qualifier' and 'PfaI EC50 uMol (Mean)' were removed
Rows without Smiles were removed as well as rows without Pfal data. The latter means that at least one the remaining columns starting with 'Pfal' had to contain a value.
The molecular structures were normalised: salts stripped, canonical tautomer calculated, charges normalised, etc.
Rows were merged by (recalculated) InChiKey.
Activity data was pivoted into columns 'Assay', 'Value' and 'Qualifier'. Activity values that were not IC50 like '100% at 40 micromolar' were removed as well as values that did make sense like '0'. The original Pfal columns were left in place so that it can be seen where each data point comes from. The file was split on the three new columns so that 1 row = 1 value. During this process all other columns were copied so there is redundancy. I leave it to each individual if and how they want to average multiple values for a single compound.
Series annotation was done again since not all compounds claimed to be from series 4 contained the 'triazolopyrazine core with substitutents in the northwest and northeast positions' mentioned before, see #1 (comment). When the original series annotation was '4' but the compound contains another core (or does not have two substituents in the right position) the Series annotation is overwritten as 'not4'. Note that all series are still there, leaving it open whether or not to include data from other series.

Result is Master Chemical List - annotated

Edwin Tse · Answer 37 · Tue Aug 20 2019 22:38:46 GMT+0800 (China Standard Time)

@wvanhoorn To answer your previous question, both are IC50 measurements but are of different strains of the parasite. Most IC50 values are against the NF54 or 3D7 strain. This Pfal (K1) IC50 (Guy) is against the K1 strain which is multi-drug resistant. A description of the column titles can be found here. You would be correct in that averaging these two measurements is not accurate since they are from different strains. I think the PfaI EC50 uMol (Mean) column was just meant to the relevant potencies.

Yes, those are duplicate entries from the same assay which should be included in the average but aren't.

Mykola Galushka · Answer 38 · Thu Aug 22 2019 23:52:41 GMT+0800 (China Standard Time)

@wvanhoorn Thanks a lot for cleaning this data! It is much easier to use them.

@edwintse This is a general question about building the regression model. I understand that ideally, we would like to have a model which correctly predict a potency value in any range. However, if the predicted potency value is grater than 1uM does it has any practical meaning in your research. For example, does it make any difference if the predicted potency of a compound >10 or > 25 or > 50...?

Edwin Tse · Answer 39 · Fri Aug 23 2019 20:37:55 GMT+0800 (China Standard Time)

I guess it would be a lot less useful to be able to predict inactive compounds accurately so we wouldn't really distinguish between >10, >25 and >50 as being any different

Mykola Galushka · Answer 40 · Fri Aug 23 2019 21:08:19 GMT+0800 (China Standard Time)

So would it be correct to say that accuracy of the regression model above a certain threshold (let say 10) does not matter?

Edwin Tse · Answer 41 · Fri Aug 23 2019 21:09:46 GMT+0800 (China Standard Time)

Yes, that sounds reasonable

Mykola Galushka · Answer 42 · Mon Aug 26 2019 23:29:52 GMT+0800 (China Standard Time)

@edwintse In the provided dataset their ara many records where potency values are ">10" or ">25"... How you are going to evaluate predictions of the submitted models if your experimental results show ">10" or ">25"... outcomes.

For example, one (submitted) model predicted potency value 12, another model predicted 20 and your experimental result showed ">10". Which model is more accurate?

Maybe we should introduce a potency threshold after which the predicted results should be treated the same...

Edwin Tse · Answer 43 · Mon Aug 26 2019 23:47:20 GMT+0800 (China Standard Time)

@mmgalushka As you mentioned before, the ability to differentiate between inactive compounds is not entirely necessary. For instance, if one model predicted a compound to have a potency of 12 uM while another predicted it to be 20 uM, the actual experimental results would depend on the max concentration that the assay was run in (i.e. if the compound gets made and tested, it would return a result of >10 uM regardless).

The accuracy of the models will therefore be determined based on the ability to predict active compounds, say <2.5 uM.

The assay that we currently use has a max concentration of >25 uM, however I would say that much of this upper range is not terribly useful. So perhaps an upper threshold of >10 uM would suffice.

Mykola Galushka · Answer 44 · Tue Aug 27 2019 15:21:55 GMT+0800 (China Standard Time)

Thanks, @edwintse for the clarification! Am I right to say that submission result would look like this:

Compound	Potency
c1ncc2n(c1Oc1cc3c(cc1)cccc3)c(nn2)c1ccc(cc1)OC(F)F	0.023
CCOc1ccc(cc1OCC)c2nonc2NC(=O)c3cccc(C)c3	0.453
CCOC(=O)c1ccc2nc(cc(O)c2c1)c3ccccc3	>2.5
CCOC(=O)C1=CN(CC)c2cc(N3CCCCC3)c(F)cc2C1=O	1.23
[O-]N+c1ccc(C=NNc2nc3ccccc3[nH]2)cc1	>2.5
...	...

Note: This is dummy potency values. I use it as an example.

So we indicate the potency value up to 2.5 and everything above it just indicated as ">2.5".

Mykola Galushka · Answer 45 · Tue Aug 27 2019 20:02:49 GMT+0800 (China Standard Time)

@edwintse You replied on this post that compounds for prediction are more likely come from two sources.

Do you have the final set of compounds, which we need to predict? Or it will be provided later?

Edwin Tse · Answer 46 · Wed Aug 28 2019 18:27:01 GMT+0800 (China Standard Time)

I will finalise the test set of compounds for the competition early next week and will post it here.

Mykola Galushka · Answer 47 · Wed Aug 28 2019 22:24:24 GMT+0800 (China Standard Time)

Maybe this data would be useful to someone. I tried to visualize "training" compounds in Series 4 together with "target" compounds announced in #71 and #73 for my research (see visualizations below).

None: In order to do this visualization, I standardized each SMILES (using MolVS) and converted it into a "fingerprint" (using variational autoencoder trained on ChEMBL v23).

Visualization for compounds S4 with compounds 71 is in here

#	Smiles
0	Fc1ccc(CCOc2cncc3nnc(-c4ccc(C(F)(F)F)nc4)n23)cc1F
1	Clc1ccccc1CCOc1cncc2nnc(-c3ccncc3)n12
2	FC(F)Oc1ccc(-c2nnc3cncc(OCCC4COC4)n23)cc1
3	OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1OCc1ccccc1
4	OCc1ccc(COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
5	c1ccc(CCOc2cncc3nnc(C4CCNCC4)n23)cc1
6	COc1ccc(CCOc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
7	COc1ccc(CCNc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1OC
8	OCC(COc1cncc2nnc(-c3ccc4cc[nH]c4c3)n12)c1ccccc1
9	OC@Hc1ccccc1
10	FC(F)Oc1ccc(-c2nnc3cncc(SCCc4ccccc4)n23)cc1
11	O=C(O)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1
12	FC(F)Oc1ccc(-c2nnc3cncc(OCCOc4ccccc4)n23)cc1
13	O=C(Nc1cccc(Cl)c1)c1cncc2nnc(-c3cccnc3)n12
14	COc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1

Visualization for compounds S4 with compounds 73 is in here

#	Smiles
0	Fc1ccc(CCOc2cncc3nnc(-c4ccc(C(F)(F)F)nc4)n23)cc1F
1	Clc1ccccc1CCOc1cncc2nnc(-c3ccncc3)n12
2	FC(F)Oc1ccc(-c2nnc3cncc(OCCC4COC4)n23)cc1
3	OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1OCc1ccccc1
4	OCc1ccc(COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
5	c1ccc(CCOc2cncc3nnc(C4CCNCC4)n23)cc1
6	COc1ccc(CCOc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
7	COc1ccc(CCNc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1OC
8	OCC(COc1cncc2nnc(-c3ccc4cc[nH]c4c3)n12)c1ccccc1
9	OC@Hc1ccccc1
10	FC(F)Oc1ccc(-c2nnc3cncc(SCCc4ccccc4)n23)cc1
11	O=C(O)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1
12	FC(F)Oc1ccc(-c2nnc3cncc(OCCOc4ccccc4)n23)cc1
13	O=C(Nc1cccc(Cl)c1)c1cncc2nnc(-c3cccnc3)n12
14	COc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1
15	Fc1ccc(-c2nnc3cncc(OCCc4ccccc4)n23)cc1
16	O=C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccccc1
17	CCN(CC)C(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccc(F)c(F)c1
18	O=C(Nc1ccnc(C(F)(F)F)c1)c1cncc2nnc(C34C5C6C3C3C4C5C63I)n12
19	O=C(c1ccc(-c2nnc3cncc(OCCc4ccc(F)c(F)c4)n23)cc1)N1CCOCC1
20	CN(C)c1ccc(C(O)(CO)COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1
21	OCC(COc1cncc2nnc(-c3ccc(OC(F)F)cc3)n12)c1ccc(O)cc1
22	Nc1ccc(C(CO)COc2cncc3nnc(-c4ccc(OC(F)F)cc4)n23)cc1

Willem van Hoorn · Answer 48 · Fri Sep 06 2019 01:31:24 GMT+0800 (China Standard Time)

I have a question re compounds OSM-S-418, OSM-S-424 and OSM-S-564. I think they all contain a carborane but the connectivity is odd since it consists of a single large ring with Boron/Carbon atoms. In contrast to this, MMV1794644 in the prediction set contains a carborane in the expected cluster form. The carborane of OSM-S-564 may be the same as the one in MMV1794644? If they are the same the representation should be the same.

Edwin Tse · Answer 49 · Sat Sep 07 2019 21:17:14 GMT+0800 (China Standard Time)

Apologies for the late reply. The are all carboranes and should be represented most accurately in their cluster forms. The appropriate smiles for these compounds are as follows:

OSM-S-418: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[CH]%11%124[BH]8%13%14[BH]%11%15%16[BH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32

OSM-S-424: FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC456[BH]78[BH]49([BH]%10%118[BH]%129%13%14)[BH]%15%145[BH]%16%17%13[BH]%18%10%12[BH]7%19%11[H-][BH]%19%18%16[CH]%17%156)N32.[Cs+]

OSM-S-564:
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[BH]%11%124[BH]8%13%14[CH]%11%15%16[BH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32

Edwin Tse · Answer 50 · Sat Sep 07 2019 21:48:56 GMT+0800 (China Standard Time)

I have just posted the final test set compounds for the competition in a new issue (here).

HOWEVER, please be aware that there is a high chance that the competition deadline will be extended past September 11th

I will update as soon as I find out the exact details.

Ho Leung Ng · Answer 51 · Thu Sep 12 2019 05:57:33 GMT+0800 (China Standard Time)

I'll probably need an extension. Please post the new deadline date when available.

Mat Todd · Answer 52 · Thu Sep 12 2019 18:55:23 GMT+0800 (China Standard Time)

Hi all. Based on the on- and offline conversations I've had, August holidays got in the way of many people who wanted to submit models. I think we ought to extend, to prevent people from missing out. I recommend extending to the end of the month. Any other/better suggestions?
I have to say, though, that it's really excellent to see two submitted models already from @mmgalushka and @wvanhoorn . Do we need a neat way to compare the predictions side by side, or is that trivial for people?

Jacob Silterra · Answer 53 · Thu Sep 12 2019 21:14:11 GMT+0800 (China Standard Time)

@mattodd That's what happened with me. I was planning on working up a model and submitting it by the end of September.

Jon Cardoso-Silva · Answer 54 · Mon Sep 30 2019 09:13:21 GMT+0800 (China Standard Time)

Hi! Not sure if the deadline got postponed.

I might be able to submit some results too by the end of tomorrow (30/09), but if there is another deadline extension, that would be great.

Edwin Tse · Answer 55 · Mon Sep 30 2019 17:38:32 GMT+0800 (China Standard Time)

Hi all, based on the requests for extension of the deadline, we will be extending the submission date to 2 weeks from now (i.e. final submissions by end of day 11th Oct 2019). Hopefully this will give everyone enough time to put something together.

I also wanted to just see if we will still be expecting submissions from @spadavec, @giribio and @BenedictIrwin since you had some activity on this issue earlier on. Anything would be great!

Benedict Irwin · Answer 56 · Mon Sep 30 2019 21:13:51 GMT+0800 (China Standard Time)

@edwintse Sorry, I thought you said we needed to put a binary/executable in the submitted models folder (from my early question), so I stopped work on this as it would have taken too much time for that level of submission. Looking now it's just submitting a .csv of predictions. Perhaps some clearer direction on the onset next time would have helped me stay on track.

I might be able to throw something quick in, but it won't be the best. Thanks for the nudge

Mat Todd · Answer 57 · Tue Oct 01 2019 00:58:27 GMT+0800 (China Standard Time)

Hi @spadavec - pinging you quickly here about the 2 week extension that @edwintse mentions above. Hoping you might be OK to submit something with this extra time.

Giovanni · Answer 58 · Thu Oct 03 2019 22:20:49 GMT+0800 (China Standard Time)

Hi everybody,
This week I could finally find the time to work on this. I hope I can reach a decent model before the final submissions deadline on the 11th Oct 2019. I found extremely useful to read all the comments generated so far and this helped me to get on track quickly. I would like to thank especially @wvanhoorn to generate a cleaned version of the data (file Master Chemical List - annotated ). I think this is a great starting point.

Edwin Tse · Answer 59 · Tue Oct 08 2019 22:49:16 GMT+0800 (China Standard Time)

Hi everyone,

Just a reminder that the final deadline for the competition is end of day this Friday. If I'm not mistaken, based on the extension we should expect to be getting submissions from @gcincilla, @holeung, @spadavec, @BenedictIrwin, @IamDavyG, @jonjoncardoso, @giribio and @jsilter?

In case you've not seen already, a big thank you to @wvanhoorn who has generated a cleaned version of the Master Chemical List - annotated which can be used for developing your models.

Please submit your predictions of the test set (#4) as a .csv file to this repository. If you are unable to directly upload to the submission folder, you can upload it as a zip file in a comment and tag me so I don't miss it. Let me know if you encounter any problems with submission.

Jon Cardoso-Silva · Answer 60 · Wed Oct 09 2019 03:25:32 GMT+0800 (China Standard Time)

Hi! I have opened a Pull Request (#10) with my submission.

Many thanks to @wvanhoorn for creating the clean version of the training data, it was really helpful!

Edwin Tse · Answer 61 · Wed Oct 09 2019 06:02:41 GMT+0800 (China Standard Time)

@jonjoncardoso Thanks for your submission! I've merged it into the master.

Benedict Irwin · Answer 62 · Wed Oct 09 2019 17:43:47 GMT+0800 (China Standard Time)

I have some submissions at: https://github.com/BenedictIrwin/OSM/tree/master/FinalModels

I hadn't used the Master chemical list for the small set.
For the Master model I did.

I tried to predict each assay individually because they seem to be under different conditions/ranges and merging them might not be the best strategy. I also predicted the Single shot inhibition and the Ion regulation, hopefully it is looking consistent.

There is a predicted value in the original units and then a low and high error bar for each prediction. Some of them are quite wide as you might expect with the sparse data.

I can provide similar information for the entire Master chemical sheet if the model turns out to be useful, i.e. a prediction (potentially noisy) for every cell.

There might be some optimal strategy in how to combine the different readings.

Giovanni · Answer 63 · Fri Oct 11 2019 00:09:11 GMT+0800 (China Standard Time)

Hi everybody,
I uploaded my contribution as pull request #11
After having tested several different compounds subsets, sampling methods and descriptors combinations I have to say that this seems an especially challenging modeling response. This may be due to the intrinsic complexity of the underlying target and/or to noise present in experimental data. Finally I think we reached a decent model but for sure its quality is lower than most of the models we are used to work with. As we couldn't reach a reliable regression model predicting the series-4 compound Pfal potency, we opted to develop and validate a classification model. Original Pfal potency were splitted into 2 classes:

Pfal potency <= 1 uM: active
Pfal potency > 1 uM: inactive

If a continuous value is needed to rank the molecules or to evaluate the submitted model, the probability of compounds to be active (i.e. column named “P (Pfal class=active)”) ranging from 0 to 1, can be used for such a purpose.
More detail are given in the description of the pull request.

Edwin Tse · Answer 64 · Fri Oct 11 2019 00:12:34 GMT+0800 (China Standard Time)

Thanks @gcincilla! I've just merged your submission with the master.

Vito Spadavecchio · Answer 65 · Fri Oct 11 2019 02:53:10 GMT+0800 (China Standard Time)

quick question ; one of the compounds can't be parsed by rdkit as a valid string:

OSM-LO-1
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[BH]%11%124[BH]8%13%14[BH]%11%15%16[CH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32

Is there a valid SMILES string for this?

Edwin Tse · Answer 66 · Fri Oct 11 2019 03:07:26 GMT+0800 (China Standard Time)

@spadavec this is a p-carborane containing compound. These compounds are often hard to interpret but this string is the most accurate way to represent the compound so I'm not sure there's an alternative that can be handled by rdkit.

Ho Leung Ng · Answer 67 · Fri Oct 11 2019 03:15:16 GMT+0800 (China Standard Time)

Hi. I just made a pull request with my submission and description on methodology. I used my own homology model, docking, generated 1-3D features, and then used XGBoost regressor to make my predictions.

Edwin Tse · Answer 68 · Fri Oct 11 2019 03:17:44 GMT+0800 (China Standard Time)

Wonderful, thanks @holeung! It's now been merged

Ho Leung Ng · Answer 69 · Fri Oct 11 2019 05:12:02 GMT+0800 (China Standard Time)

@spadavec, yeah, the carboranes broke most of my software. If I remember correctly, I think only the Chemaxon software could handle them. Did anyone find a way to handle them?

Vito Spadavecchio · Answer 70 · Fri Oct 11 2019 05:13:20 GMT+0800 (China Standard Time)

@holeung no, i couldn't find a way--I may have been able to figure it out via openbabel or something like that, but I decided to punt that specific prediction for a number of reasons.

Edwin Tse · Answer 71 · Fri Oct 11 2019 05:30:45 GMT+0800 (China Standard Time)

Thanks for your submission @spadavec!

Dxx221112 · Answer 72 · Fri Oct 11 2019 13:48:48 GMT+0800 (China Standard Time)

Hello, I just made my submission as a pull request @edwintse

Raymond Lui · Answer 73 · Fri Oct 11 2019 13:54:37 GMT+0800 (China Standard Time)

Hi, I have made my submission as well at this link @edwintse

Giovanni · Answer 74 · Fri Oct 11 2019 14:04:22 GMT+0800 (China Standard Time)

@spadavec, for your information OSM-LO-1 can be correctly parsed by CDK.
Nevertheless, as I excluded compounds with atom types other than H,C,O,N,S,F,Cl,Br,I (3 were originally present in my modeling set), I skipped the prediction for OSM-LO-1.

Slade Matthews · Answer 75 · Fri Oct 11 2019 16:07:19 GMT+0800 (China Standard Time)

Hi, I have also made a submission to the challenge!
Here is the link. @edwintse

Edwin Tse · Answer 76 · Fri Oct 11 2019 17:23:19 GMT+0800 (China Standard Time)

Many thanks to @IamDavyG @luiraym and @sladem-tox for your submissions! They have all been received and merged.

Willem van Hoorn · Answer 77 · Tue Oct 22 2019 18:20:00 GMT+0800 (China Standard Time)

Forgot to post the summary of the modeling work, most credit to go to Laksh Aithani (@aced125).