Machine Learning for early detection of Parkinson's disease from RNA Sequencing data

This project is a part of Galvanize Data Science Immersive Capstone in collaboration with Simpatica Medicice, an artificial intelligence driven precision medicine startup. Please read project presentation.pdf for more details about the project.

Parkinson’s disease is the disease of the nervous system which causes tremor, stiffness and slows down the movement. Although the disease cannot be cured, if it is detected early, then some of its symptoms can be alleviated. The next generation sequencing technology has made it possible to sequence the entire genome of a human being. One of the sequencing technology, called RNA sequencing also makes it possible to measure which genes are being expressed in a cell. I worked on labelled data for Parkinson’s disease and the objective is to detect Parkinson’s disease in a patient given their RNA sequencing data.

The data is very typical of biological datasets, low sample size and high dimensions. In order to filter the signal from the noise, we need to focus on a subset of features, which means a combinations of genes that are actually relevant to the disease. This calls for dimensionality reduction in the data. I found the most important features from random forest and used them for the prediction of the disease. Finally, I built a Gradient Boosting classifier on the most important features combined with Principal Component Analysis (PCA) which resulted in 75% cross-validated classification accuracy. From a published research paper in the journal Neurology using neuro-pathologic findings, only 26% accuracy for a clinical diagnosis in untreated or not clearly responsive subjects, 53% accuracy in early Parkinson's patient responsive to medication (<5 years' duration), and >85% diagnostic accuracy of longer duration, medication-responsive Parkinson's disease was found. Compared to 26% and 53% accuracy, my machine learning model gives an accuracy of 75% on data from only 53 patients, which is quite high. This clearly reveals that the conventional diagnosis accuracy is poor in early stages when the symptoms of the disease are not fully formed, and machine learning offers a great opportunity for more accurate, fast and early disease diagnosis. I further implemented k-means clustering and found out the almost 90% of the genes belong to a big cluster and the remaining 10% collectively belong to small clusters. This result led me to hypothesize that genes belonging to the small clusters are behaving differently and are important biomarkers of the Parkinson’s disease. To test this hypothesis I ran a random forest model on the minority genes found in the minority clusters and found the cross validated accuracy to be 50%, which is very low, which is a negative result.

MSopranoInTech / Parkinson-disease-prediction

Machine Learning for early detection of Parkinson's disease from RNA Sequencing data

About