mims-harvard / TDC

Therapeutics Commons: Artificial Intelligence Foundation for Therapeutic Science

Home Page:https://tdcommons.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Same drug-target pair has different affinities in Davis

luoyunan opened this issue · comments

Describe the bug
The Davis dataset is assumed to contain a unique affinity value for a drug-target pair. However, in TDC, there are duplicated drug-target pairs with different affinity values.

To Reproduce

from tdc.multi_pred import DTI
data = DTI('DAVIS', path='./data/TDC')
df = data.get_data()
df = df.drop(columns=['Drug', 'Target'])
df = df[(df['Drug_ID'] == 25243800) & (df['Target_ID'] == 'RET(V804M)')]
print(df)

Expected behavior
The expected output is given below. Different Y values were labeled for drug 25243800 and target RET(V804M).

        Drug_ID   Target_ID      Y
18196  25243800  RET(V804M)    4.8
18197  25243800  RET(V804M)    4.0
18198  25243800  RET(V804M)  350.0
18199  25243800  RET(V804M)  340.0

Environment:

  • TDC version: 0.3.0
  • davis.tab version on dataverse: 2021-01-09 (UNF:6:x6TTv0Um70rEZT/eL8eCtA==)

Additional context
When compared to the raw data of the Davis et al. paper, it looks like the four affinities values shown above should be assigned to targets RET, RET(M918T), RET(V804L), and RET(V804M), respectively. It seems all target IDs were overwritten by RET(V804M).

Thanks for pointing out the bug. Great catch. I think the issue is that it seems RET(M918T), RET(V804L), and RET(V804M) are three variants of RET target. So the target sequence would all be the same but the target ID is different. So the target sequence-SMILES pair itself is still correct but the naming of the target ID is wrong. We will update the correct target ID of these targets in the next release.

Thanks! But if we use unique IDs (e.g., RET(M918T), RET(V804L), etc.) and the same sequence for those mutants, I think there would still be ambiguity for the ML model? In other words, the same inputs (X) are mapped to different affinity values (y) in the data.

I think there are two potential ways to address the issue:

  1. Use different IDs and their corresponding sequence. For example, for RET(M918T), we change the 918th AA from M to T.
  2. For a protein with multiple mutants, only keep the strongest binding affinity value. This is what the Kiba dataset did when integrating the Davis dataset.

Thank you for the suggestion! I think both solutions make lots of sense. I will discuss this with the team and arrive at a final solution. Will keep you posted here!

Hi, we decide to follow 2. Mainly because 1 has several gene names with no clear gene sequence modification.

To reopen, Haoran points out there are also issues with BindingDB and KIBA. We will discuss to (1) keep the highest binding affinity pair (2) keep all of them, and provide a function for various removal schemes, e.g. remove highest, retain mean, and etc. it may be also useful information to know the variance of experimental result to reduce outlier effect

An update: for DAVIS/KIBA, we update the datasets to keep the max affinity for duplicated DTI pairs. For BindingDB, we provide a function for users to decide how to deal with them. You can now use

from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd')
data.harmonize_affinities(mode = 'max_affinity')

the current supported mode is max_affinity and mean.