Molecular Sequences are accepted as the definitive aspect of an organism's properties but sifting large volumes of high-confidence biological data is a challenging process. Recent methods are far superior when it comes to localization and comparison in addition to relevant evaluation metrics which justify those claims but Regex-based algorithms serve as the logical foundation of traditional approaches.
The information gained from the sequence comparisons can allow us to gain insights into the causality of the phenotypes observed (not true all the time but is a good starting point to work with). The code has been tested for scalability but is still limited by the excel capabilities and the use of custom similarity metric was employed.
From the given demo excel files we seek to observe the following patterns being captured in the outputs on running the code.
The first row of the input file must contain the comparison string and only single-sized k-mers are iterated each time.
Match pattern
MisMatch pattern
Although the code is extremely inefficient the focus of this project was to introduce myself to the domain concepts which would be further mapped to future projects.
The Documentation of this project can be found here
The python code for the project can be found here
Prerequisites
You will need the openpyxl library which can be found from here