ria18405 / ProtienAlignment

Implemented algorithm to allign protien sequences

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PROTEIN ALIGNMENT ALGORITHM

  • Python script:​ ProtienAlignment.py

  • Input: ​ FASTA format file (Default: protein.fa)

  • Output:​ 2 csv format files (Default: dotplot_output.csv, sum_matrix_output.csv)

  • Requirements:-

    • One needs to install pandas​ library for python3 before using the code.
  • Command to run the script:-

    • To use default input and output files:

      $ python3 ProtienAlignment.py

    • To provide your own input file:

      $ python3 ProtienAlignment.py -i [filename]

  • STEPS:

    • Step 0: Reading the original sequences-

      The getopt.getopt​ method parses command line options and parameter list. We try to find running options and arguments and save them in ​ opts, args​ . If the command returns an error, we output the correct format of the command and exit. Else, the option is read, if option is ​ -i​ , that is input file follows, we assign the arg to input file. We open the input file, for each line in file, if the line starts with ‘​> ’ ​ symbol and is not a blank line, we add that line to the list ​ seq[]. seq[0] and s ​ eq[1]​ gives the original sequences to be aligned. Actually, we are converting a sequence stored in the FASTA file to string data type.

    • Step1: Creating the dotplot-

      We create a matrix named dotplot​ of dimensions (length of sequence 1 * length of sequence 2) and initialize all cells to zero. Now we iterate over each cell, and check corresponding elements of the sequence strings. If both the elements are same (match), we place 1, else (mismatch) we place a 0 in that cell .Converting the dotplot​ into dataframe and loading to csv file- We create two lists, named ​ colnames​ having elements of sequence string 1 at each index, similarly ​ rownames​ , having elements of sequence string 2.Then we create a dataframe​ ​df​ with elements of rownames​ as index and elements of ​ colnames​ as columns and load the 1,0 values from dotplot created earlier. To put the dataframe into a csv file named d otplot_output.csv​,​ we use the to_csv​ command using commas as separators.

    • Step2: Making a sum matrix and loading it into a csv file-

      We make a copy of dotplot named sum_matrix​ using ​ deepcopy​ (so that changes in ​ sum_matrix​ are not reflected in dotplot​ ), next we fill in the sum_matrix.​ As the last column and last row are always same as that of dotplot​ , we start out iteration from index len(sequence)-2 to index (-1) reducing index by 1 at every step. For each cell, we follow the following algorithm: new_value(row r, column c)= value(r,c) + max{value(r+1,c+1),cells(r+1,c+2 to c_max),cells(r+2 to r_max, c+2)} We create a dataframe ​ df_sum​ and load it into a csv file named sum_matrix_output​ as we did with dotplot.

    • Step3: Backtracking to find the required alignment.

      We start from the top left corner, it has the greatest value in the complete matrix. Now we move diagonally downwards to the next maximum value. If, to reach next maximum value, we proceed downwards in a column, we introduce a gap in the sequence on the top. If in a column, all values till the end of column are 0, we introduce a gap in the sequence on the left, else, append the corresponding element of the top and left sequences into first_seq_allign​ and s ​ ec_seq_allign​ . Following the above conditions, reach the end of the matrix. The final alignments using Needleman Wunsh algorithm (with 0 gap penalty) are given by first_seq_allign and sec_seq_allign. The final outcome (optimal alignments) are displayed on the terminal, along with a dotplot and matrix of sums.

About

Implemented algorithm to allign protien sequences


Languages

Language:Python 100.0%