Daniboy370 / Machine-Learning

Research project and hands-on work

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This project aims to explore the data in search of hidden patterns using non deep learning classification tools. The findings will be analyzed in context of clinical aspects using interactive data visualizations. This way I hope to establish a robust understanding of machine learning in context of healthcare.

Open In Colab

Contents

Abstract

The project is based on the Wisconsin Breast Cancer Dataset (WBCD) and aims to implemenet ML algorithms for an accurate diagnosis of cancerous cells :

                                         

Discrimination between malignant and benign cells, can be obtained by digesting and extracting meaningful faetures, before delivered to the classifier. The raw dataset (including indices and id) after applying random shuffling:

         

Method

After thorough inspection and investigation of the data, the samples undergo PCA, where each variable (feature) has an associated red arrow (after scaling factor), in the directions that maximize each of the PC’s variance. Consider the following 3D demonstration :

                                   

The blue points are benign instances after compression (PCA) to the 3D space. The orange denote malignant instances. Note how the Concave points feature maximizes the 1st PC's variance. Contrarily, Fractal dimension and Symmetry, contribute poorly to the 3rd PC.

The ROC reflects a binary classifier ability to discriminate classes, using a probabilistic analysis. Each threshold is a point on the ROC graph, denoting the TPR/FPR tradeoff :

                       

Results

By performing the following pipeline, the performance can be concentrated :

                                  alt text

All classifiers exhibit satsfactory results, but the SVM outperformed all. Being honest, the relatively modest amount of samples in the dataset, that may cause overfitting. Therefore, the classifiers results showed slight sensitivity each the random initialization.

The following figure presents data projection into 2D coordinate system, and applying different classification tools :

                                 

It is interesting to see how decision boundaries are largely influenced by differenet classification methods 🧐.

Citation

  • Wisconsin Breast Cancer Dataset (WBCD) :
@misc{Dua:2019 ,
author = "Dua, Dheeru and Graff, Casey",
year = "2017",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences" }

Requirements

  • Google [Colab] or at least Python 3.4.

  • List of imported packages can be found in the first block of ML_Proj.ipynb.

About

Research project and hands-on work


Languages

Language:Jupyter Notebook 99.5%Language:TeX 0.5%