- Take read1 and read2 from the sequence files. Reverse complement read2.
- Find LCS(Longest Common Subsequence) of the two sequences using Suffix Automata.
- Calculate the region to merge around the LCS. Call it areaX from now on.
- Produce final string in two ways:
- Merge areaX normally. In case of mismatches choose base depending on quality scores. -> FinalRead column in CSV
- Perform Global Alignment on areaX. Merge based on quality score. -> FinalReadGA column in CSV
- Calculate the merge accuracy in areaX.
- Read strings from pRESTO TailRC_assembled-pair.fastq.
- Compare them with FinalRead and FinalReadGA column. -> pRESTO vs Code GA_score column
- Calculate the number of strings that match ( 420 strings in this case). -> isMatch column
- Align pRESTO strings with FinalRead. Calculate indel and substitution numbers.
- Calculate accuracy of alignment, as : -> Accuracy after aligning with pResto column
Attached the used dataset and obtained csv for both NgMerge and pRESTO. Dataset were produced using IgSim.
name : Sequence NameLCS : LCS found with suffix automata
Index_R1 : index of where LCS started in Read 1
Index_R2 : index of where LCS started in Read 2
Prefix : in Read 1
Suffix : in Read 2
Overlap Score : N/a
Global Align Score in Overlapping area :N/a
InDels in GA,SUbstitutions in GA : N/a
Final Read : final output after using algorithm
Final Read Length : N/a
Final Read after GA : final output using only Global alignment without the algorithm
Final Read GA Length : N/a
Total accuracy in region : N/a
Total accuracy in region after GA : N/a
presto string/ng string : string obtained from tools
presto length: N/a
Matches: whether tool output matches algorithm output
pRESTO/Ng vs Code GA_score : global alignment score between tool and algorithm output
Accuracy after aligning with pResto/Ng : Accuracy of algorithm output with tool output