ahmadpgh / simDEF

simDEF is an NLP-based model for gene function analysis using Gene Ontology annotations of gene products and proteins.

Home Page:http://kiwi.cs.dal.ca/Software/SimDEF

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

simDEF: Definition-based Semantic Similarity Measure of GO Terms for Functional Similarity Analysis of Genes

Background

The rapid growth of biomedical data annotated by Gene Ontology (GO) vocabulary demands an intelligent method of semantic similarity measurement between GO terms facilitating analysis of functional similarities of genes since compared with sequence and structure similarity, functional similarity is more informative for understanding the biological roles and functions of genes. Many important applications in computational molecular biology such as gene clustering, protein function prediction, protein interaction evaluation and disease gene prioritization require functional similarity. Some existing semantic similarity measures combine similarity scores of single GO term pairs to estimate gene functional similarity, whereas others compare terms in groups to measure it. Nevertheless, all of these measures are strictly dependent on the ever-changing topological structure of GO; they are extremely task dependent leaving no room for their generalization, and none of them takes the valuable textual definition of GO terms into consideration. These limitations present the challenge of measuring gene functional similarity reliably.

Results and conclusions

This project introduces simDEF, an efficient method for measuring semantic similarity of GO terms using their GO definitions. In essence, simDEF is an optimized version of Gloss Vector measure which is commonly used in natural language processing (NLP). Pointwise mutual information (PMI) is employed for this optimization. After constructing optimized definition-vectors of all GO terms, the cosine of the angle between terms’ definition-vectors represents the degree of similarity between them. Experimental studies show that simDEF outperforms existing semantic measures in terms of correlation with sequence homology and gene expression data and also demonstrate its superiority for prediction of true from false interactions in a protein-protein interaction (PPI) task. Relative to existing similarity measures, when validated on a yeast reference database (i.e. Saccharomyces cerevisiae), simDEF improves correlation with sequence homology by up to 50%, shows more than 4% correlation with gene expression in biological process hierarchy of GO, and increases protein-protein interaction (PPI) predictability by more than 2.5% in F1-score for molecular function hierarchy.

Availability

These free codes can be used, modified and redistributed without any restrictions.
Release date: September, 2015
Documentation: Please refer to the provided instruction file before use. (Highly recommended)

Datasets for the evaluation

The datasets built in the study and employed in the evaluation analyses include (see the 'EXPERIMENTAL DATA' section, 'Validation datasets' subsection for detail):

  1. Sequence Homology Data (20,167 protein pairs)
  2. Gene Expression Data (4,800 protein pairs)
  3. PPI Data (6,000 protein pairs)

Citation

simDEF: Definition-based Semantic Similarity Measure of Gene Ontology Terms for Functional Similarity Analysis of Genes Ahmad Pesaranghader; Stan Matwin; Marina Sokolova; Robert G. Beiko
Bioinformatics 2015;
doi: 10.1093/bioinformatics/btv755 (supplementary material file)



Ahmad Pesaranghader © 2015

About

simDEF is an NLP-based model for gene function analysis using Gene Ontology annotations of gene products and proteins.

http://kiwi.cs.dal.ca/Software/SimDEF

License:MIT License


Languages

Language:Perl 43.0%Language:Perl 6 29.9%Language:R 13.9%Language:MATLAB 13.2%