nicolagulmini / spaan

Model that computes the probability of a protein to be an adhesin. The best predictor you will find for protein sequences ;)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adhesin classifier

This model is inspired by SPAAN (Software Program for prediction of Adhesins and Adhesin-like proteins using Neural network), which is originally described in this paper.

Dataset

Bacterial adhesins were obtained performing jackhmmer search among reference proteomes of eubacteria with default patrameters and the domains listed in this file as query. Non adhesin proteins were obtained using the following query in uniprot:

(taxonomy_id:2) AND (reviewed:true) NOT (keyword:KW-1217) NOT (keyword:KW-1233) NOT (keyword:KW-0130) NOT (cc_function:adhesion) NOT (cc_function:"cell adhesion")

a subset of non adhesin proteins was randomly selected to match the size of adhesin dataset. Reduntant sequences (60% and 25% identity trasholds) were removed using CD-HIT.

Feature computation

Features are computed with iFeature so a parser of the iFeature output files, to obtain the vectors to feed the model, is used. They are:

  • AAC: amino acids composition
  • DPC: dipeptide composition
  • CTDC: composition
  • CTDT: transition
  • CTDD: distribution

(here if you want more information about what they are and how to compute them).

... and here a brief tutorial on how to compute them:

!rm -r iFeature
!git clone https://github.com/Superzchen/iFeature
!python iFeature/iFeature.py --file ./input.fasta --type AAC --out aac.out    # amino acids composition
!python iFeature/iFeature.py --file ./input.fasta --type DPC --out dpc.out    # dipeptide composition
!python iFeature/iFeature.py --file ./input.fasta --type CTDC --out ctdc.out  # composition
!python iFeature/iFeature.py --file ./input.fasta --type CTDT --out ctdt.out  # transition
!python iFeature/iFeature.py --file ./input.fasta --type CTDD --out ctdd.out  # distribution

PCA

Since every sequence has a (20+400+39+39+195=693)-dimensional feature vector, we performed Principal Component Analysis to reduce the dimensionality. Here the results:

Explained variance (1)

so we can take just the first 350 components, reducing the dimensionality of about the 50%.

Model

We decided to use the smallest model possible, and with just a 10-units Dense layer and a K=400 from PCA, we are able to get the best results so far.

_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_11 (InputLayer)       [(None, 400)]             0         
                                                                 
 dense_21 (Dense)            (None, 10)                4010      
                                                                 
 dense_22 (Dense)            (None, 1)                 11        
                                                                 
=================================================================
Total params: 4,021
Trainable params: 4,021
Non-trainable params: 0
_________________________________________________________________

Results

loss

acc

test_loss = 0.214177668094635
test_accuracy = 0.9396551847457886

Notice that removing regularizers and increasing neurons in the Dense layer it is possible to obtain roughly the same results (a little bit more overfitted) but in about 20 epochs.

You can follow every step in this notebook.

About

Model that computes the probability of a protein to be an adhesin. The best predictor you will find for protein sequences ;)


Languages

Language:Jupyter Notebook 97.5%Language:Python 2.5%