Regex-based Causal gene Identifier using Python

Molecular Sequences are accepted as the definitive aspect of an organism's properties but sifting large volumes of high-confidence biological data is a challenging process. Recent methods are far superior when it comes to localization and comparison in addition to relevant evaluation metrics which justify those claims but Regex-based algorithms serve as the logical foundation of traditional approaches.

The information gained from the sequence comparisons can allow us to gain insights into the causality of the phenotypes observed (not true all the time but is a good starting point to work with). The code has been tested for scalability but is still limited by the excel capabilities and the use of custom similarity metric was employed.

From the given demo excel files we seek to observe the following patterns being captured in the outputs on running the code.

The first row of the input file must contain the comparison string and only single-sized k-mers are iterated each time.

Match pattern

MisMatch pattern

Although the code is extremely inefficient the focus of this project was to introduce myself to the domain concepts which would be further mapped to future projects.

Documentation

The Documentation of this project can be found here

Run Locally

The python code for the project can be found here

Prerequisites

You will need the openpyxl library which can be found from here

Socials plug

B.E.Pranav Kumaar

Student ID @Amrita Vishwa Vidyapeetham - CB.EN.U4AIE20052

🔥 twitter

⚡ LinkedIn

❄️ Github

About

The Regex-based Casual Gene Identifier is a very simple and straight forward solution for gene identification using regular expressions, the goal is to parse excel files and predict the most probable gene segment corresponding to any phenotype.

gene regex biological-data-analysis

Languages

Language:Python 100.0%