OpenSourceMalaria / Series4_PredictiveModel

Can we Predict Active Compounds in OSM Series 4?

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggestions for New Active Compounds

edwintse opened this issue · comments

With the competition (#1) concluded and the winners announced (#18), they have been tasked with coming up with suggestions for new active compounds by using their developed models. These suggestions will be synthesised and sent for testing (hopefully) before Christmas. We can then add this all to the paper (#3) before submission.

In order to give us time to get started:

  • The deadline for first-pass suggestions will be close of business Friday 15th Nov.

  • The final deadline will be close of business Friday 22nd Nov to allow for an additional suggestion.


The structures below have been suggested by the winners to be synthesised. Currently no routes have been found to access the additional suggested compounds (with the exception of the leftmost one)

Untitled Wiley Document-3

Here are is my submission. I have also attached a longer write up and supplementary data file with SMILES and additional generated structures. The first structure (11.mol2) technically passed the selection criteria that I have set but it just looks a bit weird so I added two of what I think are the next best structures. What do you think?
Table
Final_OSMR2_GenP1_Writeup.pdf
OSMR2_GenP1_Supplementary.xlsx

Hi @IamDavyG - thanks, but those look like predictions for Series 1 (https://pubs.acs.org/doi/full/10.1021/acscentsci.6b00086), whereas we need suggestions based on Series 4 (the subject of this competition). Even though we'd like suggestions today (so we can get going in the lab) we're extending the deadline for another week, just to give people more time to come up with further suggestions (e.g. structures that are predicted to be nicely soluble). So hopefully you have time?

Hi,
I'm submitting these suggestions for the first triazolopyrazine based molecule on behalf of Ben Irwin.

The attached PDF contains our predictions for two molecules which ranked highly with our model. Two predictions are included as the best contains CHCHF2 which we think can eliminate to give HF, if this is problematic the second prediction could be used.

We will provide a more structurally distinct prediction before next Friday.

Best,
Alex Wade
Series4_predictions - Sheet1.pdf

Thanks Alex,

Just to clarify, HF is a neurotoxin which may be undesirable in vitro, up to you guys which compound you prefer.

Ben

Thanks @BenedictIrwin @adw62 for the suggestions! Based on the synthetic difficulty (and the possibility of HF generation) I'll get started on the synthesis of your second prediction.

I have since amended my methodology to generate molecules that feature the triazolopyrazine scaffold along with molecules that feature a distinct, but similar core scaffold in accordance with the objectives of this stage.

DND4 features the highest ranked predicted activity out of the generated triazolopyrazine scaffold molecules. FESOL4 features a similar but distinct core scaffold to the representative triazolopyrazine scaffold with better solubility. FESOL7 should be considered in case FESOL4 is not distinct enough, as it features a sulfur substitution into the core scaffold. This molecule was selected using only the secondary model since it is likely outside the applicability domain of the submitted model.
Lastly, while my methodology incorporates solubility to generate molecular structures, there is a trend of decreasing predicted activity with increasing solubility.
OSMR2_GenP2_Generated_Molecules.xlsx

Thanks @IamDavyG! Interesting to see that DND4 is the highest ranked S4 compound since we typically see lower activities with benzylic ethers as the pyrazine substituent. Do you know how the same compound but with an additional methylene group in the ether linker fairs (CC(C)C(C=C1)=CC=C1C2=NN=C3C=NC=C(N23)OCCC4=CN=C(C)C=C4)?

FESOL4 and FESOL7 are both really interesting too. I'll start having a look into the synthesis of those.

Attached are our predictions for some structurally distinct molecules. There are two sets of predictions. The first are predictions made by a recurrent neural network, these score highly but can have many problems and look difficult to synthesise. The second set are curated from the output of the RNN by hand. We hope the second set are more reasonable to make but these score lower. We'll leave it to you to pick a final molecule depending how adventitious you are feeling :)

Series4_predictions.pdf

We should have proposed our 2 molecules today, but unfortunately we had 2 crazy weeks so we did not have time to generate the molecules. We are very sorry about this.
We will pass you the 2 molecules in two weeks, so by December the 6th. The molecules will be of series-4 family and they will have optimized predicted Pfal activity, aqueous solubility and hopefully also CACO-2 cell permeability. We warmly hope that this is acceptable for you.

@gcincilla That should be fine. I probably won't get time to make them before Christmas but we can still include them in the paper as future work compounds.

Thanks for the suggestions @BenedictIrwin @adw62. Some really interesting structures! Based on the compounds I think we would be leaning towards the second set to avoid the potential issues that you mentioned, however.....after having had a look into ways to synthesise one of the three compounds below, I can't see an obvious way to make them. Anyone have any ideas/suggestions?

Untitled Wiley Document-2

I can see if something like ICsynth could be used through StarDrop:

https://www.optibrium.com/community/videos/presentations-webinars/451-syntheticpathways

ICsynth can only find synthisis routes for the our second and third suggestions. This may not be helpful since we have already noted some possible problems with these structures but I have attached the possible sysnthisis routes in case they are helpful.

ICSynth_out

ICSynth_out_1

We used all the activity results from series 4 compounds to train our model. The model was build as previously described by Laksh.
The five most potent compounds from series 4 were submitted to Medchemica MCexpert to generate ideas that increase solubility. Only transformations with specificity 3 or 4 were attempted, i.e. MMP transformations where the atom environment of the attachment point is taken into account 3 or 4 atoms deep.
Resulting compounds from the MedChemica MCexpert were scored using our predictive model. The top eight ranked compounds aim to maintain/improve potency and improve solubility.
EXS_OSM_series4_designs.sdf.gz

Here you find the 2 molecules we propose. We tried to open up new spaces (always inside the series-4) selecting molecules that, a part from active, are predicted as rather soluble (this was not easy) and permeable in Caco-2 cells. We cannot propose molecules very different from series-4 because the predictive model was developed to focus on this family.
If you think these 2 molecules have synthetic accessibility problems, let us known, as we have some backup molecules.

As mentioned earlier, soon we'll release our model through a publicly accessible web-based tool so that everybody can play with it. In this way synthetic and medicinal chemists will be able to explore their own ideas and check what our models say about it. We believe in the integration between human & artificial intelligence.

20191205_molomics_mol_proposal

Hi,

Ben and I have been optimizing our generative RNN method some more since our last suggestions were so hard to synthesise. This compound has a nice solubility so hopefully it's of some use in the future even if there is no time to make it now.

Series4_predictions.pdf

Hi @gcincilla, thanks for the suggestions. I just had a look at the routes to these but unfortunately some of the intermediates are harder to get a hold of than expected. Would you be able to post some of the backup molecules that you have?

On second thought, I may be able to make M295.

Hi,

Ben and I have been optimizing our generative RNN method some more since our last suggestions were so hard to synthesise. This compound has a nice solubility so hopefully it's of some use in the future even if there is no time to make it now.

Series4_predictions.pdf

This might be unstable with the O=C-O-C... group

@gcincilla We were just wondering if you replaced the 4-CN group in M295 with a 4-OCHF2 group, would that change how the compound scores in your prediction?

@edwintse, the molecule you are proposing (i.e. M4625) should be as active as M295 but it seems less soluble according with our prediction. Nevertheless, if we want to assume that risk, we can go forward with it.
On the other hand, an alternative to M4250, can be M2726. With this proposal we try to reduce the size of scaffold keeping the molecule soluble.
Here are the data of these 2 new proposals.

20191210_molomics_proposal

We received the results for the 4 suggested compounds that were ultimately synthesised (#25) and surprisingly, 3/4 were found to be inactive. I have compiled the list of all the S4-based suggestions. Do any look good to you @alintheopen? Based on what we know about the S4 SAR, none look too promising to me.

At this stage we were wondering whether @IamDavyG, @gcincilla, @BenedictIrwin, @adw62, @btatsis had any other suggested compounds that you could provide?

On thing I noticed (I mentioned in my talk), when looking back at the training set, there was a compound that was expected to be active with some confidence. The only measurement that had been made on it was Dundee pfal IC50 ">10".

I had the idea that perhaps the assay had a problem/blip, and the result was not representative? (I don't know how likely that is, would you expect much variation if test were repeated for example?). Unless you can see any particular reason this compound might polymerize or something that would disrupt the assay?

I need to check which of the following two it was, as there was some kind of inconsistency between the master chemical sheet and my dataset.

image

How easy/expensive is it to simply retest such a compound? Is it already available or would it have to be remade? If it's not feasible then no problem, just an idea. If it's very easy to retest, then I can suggest a few training set compounds that might be worth retesting (as not to miss opportunities)

If that's not an option we could try to rebuild the Alchemite model (with these 5 extra data points) and see if anything else comes out. Alex isn't in the office anymore, so it would be a slower process, I'd have to find some spare time.

In terms of already produced solutions we had a more experimental compound from the (second generation) RNN:

image

The overall activities aren't super high for any of these 'confident' suggestions. If you are looking for something 'really active', we'd need to extrapolate a bit further, it might take some further analysis. I found high activity predictions with a larger error bar were more likely to push the boundary on activity, but there is the risk it would go the other way and be less active.

The only other comment I can think of for now: Does the tert-butyl group substitution help open up the SAR? Can we simply pair this with the most active group on the left from the previous known compounds?

If you want the model's input on this process, I could try to enumerate these combinations and run them though to find the best combination?

@BenedictIrwin The compound with the potency of >10 uM corresponds to the right hand structure (the benzylic methoxy one). This was a bit of a complicated one (info about it here), but essentially the inherited compound was claimed to be enantioenriched but there was no data to support this. We remade the racemate and had the enantiomers resolved by chiral HPLC. This enantiomer was inactive while the other was active. It's likely we won't be able to retest the same compound.

The tert-butoxy group does open up the SAR a bit since we wouldn't have immediately thought to make a compound with it so that might be a good approach for enumeration.

@edwintse Thanks for checking that, it's good to know the signal may have been referring to the enantiomer. The descriptors we used for this particular model aren't sensitive to chirality.

I'll look into the t-butyl enumeration for some ideas.

At this stage we were wondering whether @IamDavyG, @gcincilla, @BenedictIrwin, @adw62, @btatsis had any other suggested compounds that you could provide?

@edwintse thank you for the update! I'm happy to know that we're closer to the paper completion. Congratulations!

Of the 5 new molecules (4 suggested and 1 derivative) 3 were correctly predicted by our model (but unfortunately not our own molecule 😢 ).

I have 1 question: at the beginning you asked to each winner to propose 2 compounds but, if I understand correctly, the second molecules that we suggested are not going to be synthesized. Is this right? I'm asking this because our proposed molecule M4625 was our favorite of the 2 we suggested. Do you think the second will result inactive?

We deployed our model in a web-based application as mentioned here so that anybody can exploit the model according to his knowledge, skill and intuition, through a collaborative design approach. Unfortunately still almost nobody actively participated, so there are limited molecular suggestions.
We hope to find time soon to promote the collaborative design for OSM so that we may suggest new molecules. Later on we may also use our generative model to propose new active compounds but first we would like to promote the collaborative design for internal reasons. I hope this is fine with you.

@gcincilla We originally asked for two suggestions, one based on S4 and the second with a structurally distinct core. I think you mentioned that generating a distinct compound wouldn't be that great using you model?

Based on the structure, I would think that M4625 would have some activity. I don't think I have much of the pyrrole starting material that I used for my initial attempts so we'll see. I suppose it doesn't necessarily need to go into the paper straight away.

OK @edwintse, thank you for the info.
Yes, your right, our current model is specific for S4 compounds and it won't work well for molecules very different from S4. In any case we have a confidence associated with each prediction (based on similarity towards S4 molecules), so you can directly see if we think a certain prediction is reliable or not. If a more general prediction model is required, we should develop it using a different strategy respect what we did for current one.

Hi @gcincilla @btatsis @wvanhoorn - can I quickly check something with you? The inactivity of the molecules synthesised above - does that change your models, or the other previous predictions you made? We were thinking about synthesising the Exscientia prediction that is the aniline derivative of the synthesised EGT454-1, and pursuing the pyrrole analogue predicted by Molomics (or a derivative like that). But before we do we wanted to be sure that the strength of the models' support for these structures is not fatally compromised by this admittedly small number of extra data points.

Hi @mattodd ,
Potentially any new data point can change a model and the predictions associated to it. Indeed one advantage of ML modeling is that you can refine your model during the hit-to-lead process, making it more and more precise. Having said that, I would say that adding the new 4 molecules to the model will not do a big difference in terms of new molecules prediction.
In our case our model correctly predicted 3 out of the 5 new molecules (1 was a synthesis derivative). You can check that by yourself accessing the system as explained here and filtering for molecule tagged as “synthesized”.
Summarizing: refining the model could be helpful and this is something we could do in the future, but I would wait to have a larger number of new data points.

Hi everyone,

I'm a new Honours student working with Alice @alintheopen at USYD. My project involves both Chemistry Education as well as organic synthesis. For the organic synthesis part of my project I'm interested in making some compounds for the Open Source Malaria project. I was directed to this thread about a molecular design competition. However, as I’m new here, I'm unsure about what has been done before or is currently a work in progress. I don't want to take anyone else's project and vice-versa. Are there Series 4 (or other) molecules here that I would be able to try and synthesise for my project? Or other molecules within the OSM project that would be helpful if I worked on them? I’m particularly interested at synthesising predicted targets to help to evaluate some of the models and communicating this project to non expert audiences.

Sebastian

Hi @Seb470, welcome aboard! So far the synthesis of these compounds has been a one-man job (me) so it'll be good to have you making some compounds too.

I think the best thing to get started would be to sign up to the web-app created by the Molomics team (@gcincilla) which is here. This will let you have a play around with creating a few new molecules and seeing if it will be predicted as active or not. It's good because in the drawing tool, you are able to make small changes to the molecule and see how that affects the prediction score.

I'm currently working on a couple of compounds and need to make an updated scheme. I should have that up on this issue tomorrow. But for now, have a play around, and maybe for some synthesis next week you could start making some of the OCHF2 core.

Hi @edwintse, thank you very much for the welcome. I've started trying to make molecules in the web-app now, it's very interesting! I haven't been able to make anything on the upper end of the leaderboard yet but we'll see.

re synthesis: Okay thank you for the starting point. I'll try to get some of that made ASAP. Likely next week, depending on how the other part of my project is going.