swarna0712 / Of-Genomes-and-Genetics-Data-Analytics

Predicting genetic disorder and disorder subclass based on familial history and its effects

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Of-Genomes-and-Genetics-Data-Analytics

To run the code, make sure to include the test and train csv from this repository files in the same folder. The project can then be run cell by cell in order or by clicking the run all option by running the of Genomes and Genetics.ipynb file on google collab, jupyter notebook or any suitable platform.

Dataset

This study focuses on the classification of Genetic Disorders as ‘Mitochondrial genetic inheritance disorders’, ‘Multifactorial genetic inheritance disorder’ and ‘Single-gene inheritance diseases’ based on the Disorder Subclass which are ‘Cystic fibrosis’, ‘Diabetes’, ‘Leigh syndrome’, ‘Cancer’, ‘Tay-Sachs’, ‘Hemochromatosis’ and ‘Mitochondrial myopa- thy’.

Detecting the genetic disorder subclass in children can help with early medical intervention, therefore helping patients with disorders to live a better quality of life in the future. Our project predicts the genetic disorder subclass a child may have based on hereditary factors such as the presence of a certain defective gene in the mother or father or their families, parents' age, pregnancy factors such as periconceptional folic acid details, history of substance abuse, serious illness, anomalies in previous pregnancies or exposure to radiation. Also given the outcomes of 5 masked tests and the presence of 5 masked symptoms, blood cell count, gender, presence of birth asphyxia, and whether autopsy shows any birth abnormalities in the patient, we predict what genetic disorder the child may have. We made use of the Gradient Boosting model to predict the genetic disorder subclass with an accuracy of 75.9% which we use to classify the genetic disorder that the patient may have. Given the genetic disorder, we can predict the disorder subclass with an accuracy of 88.87% using the same model.

The whole purpose of this study was to discover a reliable way to predict the “Genetic Disorder” to reduce the risk of error prediction when not enough samples were used for modeling. The base models that we used are Logis- ticRegression, XGBClassifier, RandomForestClassifier, Gra- dientBoostingClassifier, SVM, DecisionTreeClassifier, MLP- Classifier with the best parameters after tuning each model. The distribution of our modeling performance indicated that different datasets required different methods in order to give the best results.