appledora / BioInfo_running

Undergraduate work on Alignment and Assembly of paired-end IGH reads.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BioInfo_running

Summary of Work done so far

The whole explaination of overall process is here. The work has been divided in **two** modules. In the *first* one,information were extracted from the sequence files. In the *second* part, They were compared with pRESTO output TailRC_assemble-pass.fastq. Datasets for the whole process is added here too.

MODULE 1

  • Take read1 and read2 from the sequence files. Reverse complement read2.
  • Find LCS(Longest Common Subsequence) of the two sequences using Suffix Automata.
  • Calculate the region to merge around the LCS. Call it areaX from now on.

img

  • Produce final string in two ways:
    1. Merge areaX normally. In case of mismatches choose base depending on quality scores. -> FinalRead column in CSV
    2. Perform Global Alignment on areaX. Merge based on quality score. -> FinalReadGA column in CSV
  • Calculate the merge accuracy in areaX.
    1. In normal case , img -> Total accuracy in region column in CSV
    2. in Global Align case, img -> Total accuracy in region after GlobalAlign column in CSV In most cases, these two columns have the same value.

MODULE 2

  • Read strings from pRESTO TailRC_assembled-pair.fastq.
  • Compare them with FinalRead and FinalReadGA column. -> pRESTO vs Code GA_score column
  • Calculate the number of strings that match ( 420 strings in this case). -> isMatch column
  • Align pRESTO strings with FinalRead. Calculate indel and substitution numbers.
  • Calculate accuracy of alignment, as : img -> Accuracy after aligning with pResto column

pRESTO ->

1658/1702 : exact match remaining : 100% > accuracy > 99%

ngmerge ->

1678/1701 : exact match remaining : 100% > accuracy > 99%

Attached the used dataset and obtained csv for both NgMerge and pRESTO. Dataset were produced using IgSim.

CSV Columns explanations =>

name : Sequence Name

LCS : LCS found with suffix automata

Index_R1 : index of where LCS started in Read 1

Index_R2 : index of where LCS started in Read 2

Prefix : in Read 1

Suffix : in Read 2

Overlap Score : N/a

Global Align Score in Overlapping area :N/a

InDels in GA,SUbstitutions in GA : N/a

Final Read : final output after using algorithm

Final Read Length : N/a

Final Read after GA : final output using only Global alignment without the algorithm

Final Read GA Length : N/a

Total accuracy in region : N/a

Total accuracy in region after GA : N/a

presto string/ng string : string obtained from tools

presto length: N/a

Matches: whether tool output matches algorithm output

pRESTO/Ng vs Code GA_score : global alignment score between tool and algorithm output

Accuracy after aligning with pResto/Ng : Accuracy of algorithm output with tool output

About

Undergraduate work on Alignment and Assembly of paired-end IGH reads.


Languages

Language:Python 49.1%Language:HTML 19.5%Language:C++ 16.4%Language:Jupyter Notebook 11.7%Language:Shell 3.3%