machine-learning data-science pandas python numpy sklearn logistic-regression random-forest-classifier

Heart Disease Analysis

by: Aviv Farag, Hope Birdsong, Joshua Geller and Willie Hood

Abstract:

This research presents a data analytics approach and breakdown of the dataset containing heart disease patient symptoms. Our analysis explores heart disease among a population of males and females between 29 and 77 years of age using risk factors that determine its prevalence. We used popular python libraries to demonstrate our data interpretation and exploration. We were able to plot various forms of the dataset to show which symptoms were most important to the audience. We were able to leverage tensorflow and other python machine learning packages to overall precision of the dataset.

Python Packages:

pandas
import pandas as pd
numpy
import numpy as np
matplotlib.pyplot
import matplotlib.pyplot as plt
tensorflow
import tensorflow as tf
seaborn
import seaborn as sns

sklearn:

sklearn.metric:
1. f1_score
2. precision_score
3. recall_score
4. confusion_matrix
sklearn.linear_model: LogisticRegression
sklearn.model_selection: train_test_split
sklearn.pipline: Pipline
sklearn.ensemble: RandomForestClassifier
sklearn.decimposition: PCA
sklearn.preprocessing: StandardScaler

from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline 
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Functions

outliers(data,column ,outliers)
removes outliers from a specific column in data. The outliers variable is a list for example [25,75] to remove any data outside this range.
clean_data(data)
Remove NaNs in data and replace values 1,2,3,4 in the "target" column with 1 (heart disease).
discretize(data, column, threshold)
discrectize a specific column in pandas dataframe (data) according to a threshold value (double).
max_HR_percent(data, percent = 0.85)
Creates a new column in pandas dataframe (data). This new column contains 1 for patients that didn't reach at least 85% of their target heart rate and 0 otherwise. Target heart rate was calculated by the formula: Target Heart Rate = 220 - age
heatmap_cor(dataset, plot_title, method = "spearman")
Plot correlation table as a heat map.
run_random_forest(x_train,x_test,y_train,y_test, estimator = 10)
Fit random forest to training data and return this model
plot_confusion_matrix(y_test,x_pred,plot_title)
Plot confusion matrix based on test data and predictions
ml_train_test_split(x,y,size = 0.20,rs = 42)
Split data to training and test data. The size of test data is defined according to "size" and rs is random_state argument in train_test_split method of sklearn.

Setup and running the code:

Clone the repo using the following command in terminal:
git clone https://github.com/avivfaraj/DSCI521-project.git

After cloning the repo, open hd_analysis.ipynb and run each cell one at a time in the order that they are presented. You can run the whole notebook in a single step by clicking on the menu Cell -> Run All.

The first two sections are packages and functions which are required for the code to run. Make sure to run those two sections before running the program.

Acknowledgements

UCI Heart Disease Data Set

Creators:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Donor: David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

About

A program in python to analyze data set that contains attributes which are related to heart disease.