aminhbl / bioinformatics-algorithms

Three fundamental algorithms of bioinformatics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Bioinformatics Algorithms

Three fundamental algorithms of bioinformatics

Quick links

Overview

This is the implementation of three of the most important algorithms that every student interested in the field of Bioinformatics should know (my humble opinion!).

Semi-Global Alignment

First we initialize and fill in the scoring maxtrix since we're following the Dynamic Programming approach. The idea is that the score of the best possible alignment that ends at (i, j) in matrix, is equal to the score of best alignment ending just previous to those positions (i-1, j-1), plus the score for aligning Xi and Yj (residue i in protein sequence X and residue j in protein seequence Y).
For scoring we use PAM250 to calculate Match and Mismatch and a fixed score of -9 is considered for gaps.
After finding the score of optimum alignment, which would be the highest score in bottom row or right-hand column of the matrix, we trace back through the matrix to recover that optimum alignment.

Quick Start

Input two protein sequence with maximum length of 100 residues in capital letters.

HEAGAWGHE
PAWHEA

Output will be the score of the alignment along with the pairwise aligned sequences.

20
HEAGAWGHE-
---PAW-HEA

Block-Based Star Alignment

First we perform Star Alignment by finding the pariwise alignment scores for each pair of protein sequences and form the distance matrix. Then we pick the center sequence based on the minimum total of distances for each sequence. Then sequences will be added to get aligned in dicreasing order of similarity (increase of distance) in respect to the center sequence.
Now to have better alignment for more divergent sequences we improve on the multiple sueqence alignemnt that we have achived by performing Block Based alignment. This is how we do this:
we find blocks with minimum two columns within the alignment containing gaps or mismatches, then apply the star_alignment for these blocks alone. If realignment of any block improves the overall score of total alignments we switch to the realigned block.

Quick Start

Input number of protein sequences to be aligned and then each sequence in separate line with capital letters.

5
TAGCTACCAGGA
CAGCTACCAGG
TAGCTACCAGT
CAGCTATCGCGGC
CAGCTACCAGGA

Output will be the overall score of the multiple sequence alignment and aligned sequences in each line respectively to the input sequence.

240
TAGCTA-C-CAGGA
CAGCTA-C-CAGG-
TAGCTA-C-CA-GT
CAGCTATCGC-GGC
CAGCTA-C-CAGGA

Profile-Based Alignment

Here we construct a statistical model by first building a profile using the given multiple sequenses, calculating the probability of appearance for each residue at it's respective position. In this process we use pseudocount of 2 to avoid the zero probability. After that we will localy align a given long sequence to previous multiple sequences using the profile for scoring the alignment and then report the subsequence with the highest score of alignment.

Quick Start

Input number of multiple sequences to build a profile from. After that each sequence should be in separate lines followed by the long sequence.

4
HVLIP
H-MIP
HVL-P
LVLIP
LIVPHHVPIPVLVIHPVLPPHIVLHHIHVHIHLPVLHIVHHLVIHLHPIVL

output will show the aligned subsequence with highest score.

H-L-P

About

Three fundamental algorithms of bioinformatics


Languages

Language:Python 100.0%