debajyotidasgupta / Protein-Ligand-Fingerprinting

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deep-ProLiPrint: Protein Ligand Fingerprinting

python3 gen_fingerprint.py 2xni.pdb

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Deep-ProLiPrint: Protein Ligand Fingerprinting

Methods to obtain fingerprint for a protein-ligand complex.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Project Details
  4. Usage
  5. License
  6. Contact
  7. Acknowledgments

About The Project

Screen Shot

This project demonstrates the implementation of the Protein Ligand Fingerprinting which is a computational method used to analyze the interactions between proteins and small molecules (ligands). It involves the generation of a set of features or "fingerprints" that characterize the chemical and physical properties of both the protein and the ligand and are often used as input to various ML algorithms and also used to measure similarity between complexes. This project is built as a part of the course CS61060 Computation Biophysics: Algorithms to Applications at Indian Institute of Technology, Kharagpur. This project implements some basic fingerprinting methods which are as follows:

  • Neighbourhood based fingerprinting
    • obtains counts of N, CA, C, O and R atoms of protein in the neighbourhood of each ligand atom
  • Encoded Neighbourhood based fingerprinting
    • encodes the Neighbourhood based fingerprinting using a transformer model to obtain fixed length fingerprint
  • Kmer based fingerprinting
    • obtains fingerprintings based on presence/absence of k-mers
  • MACCS key
    • pre-existing fixed length ligand fingerprinting method

(back to top)

Built With

Following mentioned are the major frameworks/libraries used to bootstrap this project. Also included are the dependencies and addons used in this project.

  • Python
    • Numpy - The fingerprints are mainly stored as numpy vectors
    • PyTorch - Mainly required for the Transformer model
    • scikit-learn - Used for implementing RandomForestRegressor to predict binding affinity of protein-ligand complexes using our fingerprinting method
    • DeepChem - Used to download pdbbind data for binding affinity of complexes
    • Biopython - Used to parse PDB files
    • RDKit - Used to generate SMILES for ligands
    • Pandas - Used to store the data in a dataframe
    • Matplotlib - Used to plot the graphs
    • Seaborn - Used to plot the graphs
    • tqdm - Used to display progress bars
    • SciPy - Used to calculate the distance between atoms

(back to top)

Project Details

Following are the details of the file structure of this project:

.
├── binding_affinity_prediction.py
├── data
│   ├── PDB
│   ├── pdbbind_core_df.csv.gz
│   ├── pdb_files
│   └── SMILES
├── fingerprint
│   ├── alphabets.py
│   ├── base.py
│   ├── __init__.py
│   ├── interactions.py
│   ├── kmer.py
│   ├── ligand.py
│   ├── neighbour.py
│   ├── parser.py
│   ├── transformer.py
│   └── utils.py
├── gen_fingerprint.py
├── images
│   └── protein.jpeg
├── LICENSE
├── models
│   └── AutoencoderTransformer_4.pt
├── output
├── README.md
├── requirements.txt
├── similarity.py
├── train.py

Following are the details of the file structure and their functionalities that are present in this code base.

  • fingerprint/parser.py - This file contains class implementation to represent a protein-ligand complex as an object after parsing a PDB file

    • Atom - Class to store information for a single atom such as name, residue of which it is a part of, coordinates. etc.
    • Protein - Class to store protein as a sequence of atoms in chains
    • Ligand - Class to store ligand as a sequence of atoms
    • ProteinLigandSideChainComplex - Class to store a protein-ligand complex as a combination of a Protein object and a Ligand object, where the Protein object doesn't store all atoms of a side chain, rather stores it as a single atom group
    • ProteinLigandComplex- Class to store a protein-ligand complex as a combination of a Protein object and a Ligand object
  • fingerprint/base.py - This file contains class implenetation of BaseFingerprint which serves as a base class for the original NeighbourFingerprint class

  • fingerprint/neighbour.py - This file contains the class implementation for our Neighbourhood based Fingerprinting scheme

    • NeighbourFingerprint - derived from BaseFingerprint, this class obtains a fingerprint of length N*5 where N is the no of ligand atoms in the complex and dimension 5 comes for count of each N, CA, C, O and R, each entry denotes the count of the atom/group in certain radius of the ligand atom
  • fingerprint/alphabets.py - This file contains various AAR recoding schemes used in Kmer based fingerprinting

  • fingerprint/kmer.py - This file contains various class implementations for the K-mer based fingerprinting scheme

    • KmerBasis - Class to store kmer basis set and perform basis set transforms, store kmer basis set and transform new vectors into fitted basis
    • KmerSet - Given alphabet and k, creates iterator for kmer basis set
    • KmerVec - generate kmer vectors by searching all kmer sets in the protein
  • fingerprint/transformer.py - This file contains transformer implementation to encode a neighbourhood based fingerprint into a fixed length vector

    • AutoencoderTransformer - this model tries to encode a lengthy fingerprint into a fixed length vector using a Transformer based architecture.
  • fingerprint/utils.py - This file contains various utility functions that help in the fingerprinting process

  • data/PDB - folder to store/download to PDB files

  • data/SMILES - folder to store/download SMILES files

  • train.py - This file contains the code to train the AutoencoderTransformer model. It first generates the original Neighbourhood based fingerprinting for a set of ~190 protein-ligand complexes from PDBBind. It then feeds these fingerprints to the encoder of the Transformer. The task of the Transformer decoder is to decode such that the output features match as closely as possible to the encoder input. The fixed length encoding is obtained by taking mean of the encoder sequence and passing through a simple linear network followed by a sigmoid layer to obtain values in the range [0,1]

  • binding_affinity_prediction.py - This file uses the Encoded Neighbourhood based fingerprinting scheme and feeds the fingerprints to a RandomForestRegressor model so as to predict the binding affinity of a protein-ligand complex, which are then compared against standard PDBBind data

  • similarity.py - This file tries to study cosine-similarity patterns using the Encoded Neighbourhood based Fingerprinting scheme

  • gen_fingerprint.py - This is the main file that given as input any PDB-ID or PDB file, generates the 4 possible fingerprints which we have implemented

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

  • Python
    To run the code in this Assignment, one needs to have Go installed in their system. If it is not already installed.

Installation

In order to setup a local copy of the project, you can follow the one of the 2 methods listed below. Once the local copy is setup, the steps listed in Usage can be used to interact with the system.

  1. Clone the repo
    git clone https://github.com/debajyotidasgupta/Protein-Ligand-Fingerprinting.git
  2. Alternatively, unzip the attached submission zip file to unpack all the files included with the project.
    unzip <submission_file.zip>
  3. Change directory to the Protein-Ligand-Fingerprinting directory
    cd Protein-Ligand-Fingerprinting
  4. Create a virtual environment to install the required dependencies
    virtualenv venv
    or
    python3 -m venv venv
  5. Activate the virtual environment venv
    source venv/bin/activate
  6. install required dependencies with the following command
    pip install -r requirements.txt

(back to top)

Usage

Once the local copy of the project has been setup, follow these steps to generate fingerprints

Generate fingerprint for a particular PDB id

To generate fingerprint for a particular PDB id, do the following steps:

  1. Open terminal from the main project directory

  2. Run the gen_fingerprint.py file with only the PDB id or PDB filename as argument

    python gen_fingerprint.py <pdbid>

    Example

    python gen_fingerprint.py 2XNI
    or
    python gen_fingerprint.py 2XNI.pdb
  3. An output will be displayed on the screen comprising the fingerprint obatined using all the 4 techniques mentioned earlier

Outputs

Following four outputs are generated and saved in the mentioned files

  1. Neighbourhood based fingerprint - saved in output/<pdb_id>/<pdb_id>_neighbour.txt
  2. Encoded Neighbourhood based fingerprint - saved in output/<pdb_id>/<pdb_id>_neighbour_transformer.txt
  3. Kmer based fingerprint - saved in output/<pdb_id>/<pdb_id>_aar_kmer.json
  4. Ligand MAACS Key - saved in output/<pdb_id>/<pdb_id>_maacs.txt

Running an ML model to predict binding affinity of complexes

To train and test a RandomForestRegressor to predict binding affinity of complexes

  1. Open terminal from the main project directory
  2. Run the binding_affinity_prediction.py file
    python binding_affinity_prediction.py
  3. An output will be displayed on the screen showing the R^2 score of the model compared against PDBBind dataset

License

Distributed under the Apache License 2.0. See LICENSE.txt for more information.

(back to top)

Contact

Name Roll No. Email
Debajyoti Dasgupta 18CS30051 debajyotidasgupta6@gmail.com
Somnath Jena 18CS30047 somnathjena.2011@gmail.com

(back to top)

Acknowledgments

List of resources we found helpful and we would like to give them some credits.

(back to top)

About

License:Apache License 2.0


Languages

Language:Python 100.0%