Same drug-target pair has different affinities in Davis
luoyunan opened this issue · comments
Describe the bug
The Davis dataset is assumed to contain a unique affinity value for a drug-target pair. However, in TDC, there are duplicated drug-target pairs with different affinity values.
To Reproduce
from tdc.multi_pred import DTI
data = DTI('DAVIS', path='./data/TDC')
df = data.get_data()
df = df.drop(columns=['Drug', 'Target'])
df = df[(df['Drug_ID'] == 25243800) & (df['Target_ID'] == 'RET(V804M)')]
print(df)
Expected behavior
The expected output is given below. Different Y
values were labeled for drug 25243800
and target RET(V804M)
.
Drug_ID Target_ID Y
18196 25243800 RET(V804M) 4.8
18197 25243800 RET(V804M) 4.0
18198 25243800 RET(V804M) 350.0
18199 25243800 RET(V804M) 340.0
Environment:
- TDC version: 0.3.0
davis.tab
version on dataverse: 2021-01-09 (UNF:6:x6TTv0Um70rEZT/eL8eCtA==)
Additional context
When compared to the raw data of the Davis et al. paper, it looks like the four affinities values shown above should be assigned to targets RET
, RET(M918T)
, RET(V804L)
, and RET(V804M)
, respectively. It seems all target IDs were overwritten by RET(V804M)
.
Thanks for pointing out the bug. Great catch. I think the issue is that it seems RET(M918T), RET(V804L), and RET(V804M) are three variants of RET target. So the target sequence would all be the same but the target ID is different. So the target sequence-SMILES pair itself is still correct but the naming of the target ID is wrong. We will update the correct target ID of these targets in the next release.
Thanks! But if we use unique IDs (e.g., RET(M918T)
, RET(V804L)
, etc.) and the same sequence for those mutants, I think there would still be ambiguity for the ML model? In other words, the same inputs (X) are mapped to different affinity values (y) in the data.
I think there are two potential ways to address the issue:
- Use different IDs and their corresponding sequence. For example, for
RET(M918T)
, we change the 918th AA from M to T. - For a protein with multiple mutants, only keep the strongest binding affinity value. This is what the Kiba dataset did when integrating the Davis dataset.
Thank you for the suggestion! I think both solutions make lots of sense. I will discuss this with the team and arrive at a final solution. Will keep you posted here!
Hi, we decide to follow 2. Mainly because 1 has several gene names with no clear gene sequence modification.
To reopen, Haoran points out there are also issues with BindingDB and KIBA. We will discuss to (1) keep the highest binding affinity pair (2) keep all of them, and provide a function for various removal schemes, e.g. remove highest, retain mean, and etc. it may be also useful information to know the variance of experimental result to reduce outlier effect
An update: for DAVIS/KIBA, we update the datasets to keep the max affinity for duplicated DTI pairs. For BindingDB, we provide a function for users to decide how to deal with them. You can now use
from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd')
data.harmonize_affinities(mode = 'max_affinity')
the current supported mode is max_affinity and mean.