raysas / data-mining-gene-expression

Data Mining project (Fall2023) involving the classification and clustering of Sars-Cov-2 gene expression RNA-seq data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Mining Project

Made with love & tears & enthusiasm

This is a mutliphase projet for the Data Mining course at LAU in Fall 2023. The project is divided into 3 phases, this repo consists of:

Data: Sars-Cov-2 gene expression data from GEO. The data is available in the data folder.

Phase 2 and 3 are done using the same dataset, withdrwan from GEO. The dataset is a collection of RNA-seq raw counts for 455 samples, involving 60705 genes, Out of the 455, 417 have clear covid status (positive/negative). The dataset is available in the data folder.

Phase 2: Classification

In this phase, we used preprocessing steps to clean the data, and then used 3 different classification algorithms to classify the samples into positive and negative covid status. The algorithms used are:

  • logistic regression
  • Linear Discriminant Analysis
  • Quadratic Discriminant Analysis

We used different resampling techniques to evaluate the performance of the models and compare test errors. The resampling techniques used are:

  • Validation set approach 80/20
  • 5-fold cross validation (empirical k=5)
  • Leave one out cross validation (LOOCV)

Feature selection was applied as well to reduce the number of features and improve the performance of the models. The feature selection techniques used are:

  • Forward selection
  • Backward selection

Note that out of the 60K genes, ~18K was left after normalization and preprocessing. Then we applied the highest variance technique - i.e. got the top 100 genes that showed between samples variance. Then we applied the feature selection techniques on the 100 genes. In an ideal situation we would've applied dimensionality reduction (PCA) which bases on the same idea. However, this will be applied in phase 3.

Details on this phase here.

Phase 3: Classification and Clustering

In this phase we extended classification to include Decision Trees (DT) and moved to unsupervided learning approaches including Dimensionality Reduction (DR) and Clustering.

Contributors: