Shen-Lab / CPAC

Cross-Modality and Self-Supervised Protein Embedding for Compound-Protein Affinity and Contact Prediction

Motivation

Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound-protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often force structure-free methods to rely on protein sequence inputs alone. The scarcity of compound-protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.

Results

To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we, for the first time, introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps, that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pretrained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.

Data

Please download the processed data from https://zenodo.org/records/11005446, and extract them by:

unzip data.zip
unzip pretrain_data.zip

Experiments

Cross-modality protein embeddings [keras] [pytorch]
Pre-training with MLM and GraphComp
Finetuning

Discussions

For the contact prediction training, we use the regression loss (where the ground-truth, binary contact matrix was normalized) rather than the standard classification loss. We find that regression loss provides great benefits on precision evaluation (see supplemental results for evidence). One might opt to the loss of the standard binary cross entropy with a non-normalized, binary ground-truth contact matrix, by modifying the codes in the above linked lines.

Featurization

To process raw data into input formats of CPAC, we follow the same procedure as in DeepRelations (https://pubs.acs.org/doi/full/10.1021/acs.jcim.0c00866#) with the detailed description in its supplement (https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.0c00866/suppl_file/ci0c00866_si_001.pdf). We further provide a utils file for this purpose: https://github.com/Shen-Lab/CPAC/blob/main/featurization_utils.py.

High-level summary of the featurization:

Proteins graph: Residues as nodes, and edges justified by spatial distances of C-alpha atoms

Node feature: One-hot embedding of the residue types (mapping dictionary:

CPAC/featurization_utils.py

Line 2 in bc7fc71

    
           token_dict = {'_PAD':0, '_GO':1, '_EOS':2, '_UNK':3, 'A': 4, 'R': 5, 'N': 6, 'D': 7, 'C': 8, 'Q': 9, 'E': 10, 'G': 11, 'H': 12, 'I': 13, 'L': 14, 'K': 15, 'M': 16, 'F': 17, 'P': 18, 'S': 19, 'T': 20, 'W': 21, 'Y': 22, 'V': 23, 'X': 24, 'U': 25, 'O': 26, 'B': 27, 'Z': 28}

)

Edges: If structures are present, the adjacency matrix is constructed with a_ij = 1 if C-alpha distance <= 8 angstroms. If not, computational tools can be used to predict it (Section 1.3 in https://pubs.acs.org/doi/suppl/10.1021/acs.jcim.0c00866/suppl_file/ci0c00866_si_001.pdf)

Compound graphs: Atoms as nodes and edge featurization would be more complicated considering chemical topology. A SMILES to compound graph function can be found at:

CPAC/featurization_utils.py

Line 187 in bc7fc71

def read_graph(smiles, MAX_size):

.

Citation

If you use this code for you research, please cite our paper.

@article{10.1093/bioinformatics/btac470,
    author = {You, Yuning and Shen, Yang},
    title = "{Cross-modality and self-supervised protein embedding for compound–protein affinity and contact prediction}",
    journal = {Bioinformatics},
    volume = {38},
    number = {Supplement_2},
    pages = {ii68-ii74},
    year = {2022},
    month = {09},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btac470},
    url = {https://doi.org/10.1093/bioinformatics/btac470},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/38/Supplement\_2/ii68/45884189/btac470.pdf},
}