sultanlive / Protein_Sequence_Classification

A case study on Pfam dataset to classify protein families.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Protein_Sequence_Classification

A case study on Pfam dataset to classify protein families.

Blog Post: https://towardsdatascience.com/protein-sequence-classification-99c80d0ad2df

Description

Proteins are large, complex biomolecules that play many critical roles in biological bodies. Proteins are made up of one or more long chains of amino acids sequences. These Sequence are the arrangement of amino acids in a protein held together by peptide bonds. Proteins can be made from 20 different kinds of amino acids, and the structure and function of each protein are determined by the kinds of amino acids used to make it and how they are arranged.

Understanding this relationship between amino acid sequence and protein function is a long-standing problem in moleculer biology with far-reaching scientific implications. Can we use deep learning that learns the relationship between unaligned amino acid sequences and their functional annotations across all 17929 families of the Pfam database.

Pfam is a database of protein families that includes their annotations and multiple sequence alignments.

Problem Statement

  • Classification of protein's amino acid sequence to one of the protein family accession, based on PFam dataset.
  • In other words, the task is: given the amino acid sequence of the protein domain, predict which class it belongs to.

Deep learning Models

I have referred this paper for defining model architectures and trained two separate models, one is bidirectional LSTM and the other one is inspired from ResNet a CNN based architecture.

ProtCNN

This model uses residual blocks inspired from ResNet architecture which also includes dilated convolutions offering larger receptive field without increasing number of model parameters.



Conclusion

The ProtCNN model has achieved significant results which are more accurate and computationally efficient than current state of the art techniques like BLASTp to annotate protein sequences. These results suggest deep learning models will be a core component of future protein function prediction tools.

Model Train Acc Val Acc Test Acc
Bidirectional LSTM 0.964 0.957 0.958
ProtCNN 0.996 0.988 0.988

About

A case study on Pfam dataset to classify protein families.


Languages

Language:Jupyter Notebook 100.0%