avivfaraj / DSCI521-project

A program in python to analyze data set that contains attributes which are related to heart disease.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Heart Disease Analysis

by: Aviv Farag, Hope Birdsong, Joshua Geller and Willie Hood


Table of Contents


Abstract:

This research presents a data analytics approach and breakdown of the dataset containing heart disease patient symptoms. Our analysis explores heart disease among a population of males and females between 29 and 77 years of age using risk factors that determine its prevalence. We used popular python libraries to demonstrate our data interpretation and exploration. We were able to plot various forms of the dataset to show which symptoms were most important to the audience. We were able to leverage tensorflow and other python machine learning packages to overall precision of the dataset.


Python Packages:

  1. pandas
    import pandas as pd

  2. numpy
    import numpy as np

  3. matplotlib.pyplot
    import matplotlib.pyplot as plt

  4. tensorflow
    import tensorflow as tf

  5. seaborn
    import seaborn as sns

  6. sklearn:

    1. sklearn.metric:
      1. f1_score
      2. precision_score
      3. recall_score
      4. confusion_matrix
    2. sklearn.linear_model: LogisticRegression
    3. sklearn.model_selection: train_test_split
    4. sklearn.pipline: Pipline
    5. sklearn.ensemble: RandomForestClassifier
    6. sklearn.decimposition: PCA
    7. sklearn.preprocessing: StandardScaler
    from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline 
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    
    

Functions

  1. outliers(data,column ,outliers)
    removes outliers from a specific column in data. The outliers variable is a list for example [25,75] to remove any data outside this range.

  2. clean_data(data)
    Remove NaNs in data and replace values 1,2,3,4 in the "target" column with 1 (heart disease).

  3. discretize(data, column, threshold)
    discrectize a specific column in pandas dataframe (data) according to a threshold value (double).

  4. max_HR_percent(data, percent = 0.85)
    Creates a new column in pandas dataframe (data). This new column contains 1 for patients that didn't reach at least 85% of their target heart rate and 0 otherwise. Target heart rate was calculated by the formula: Target Heart Rate = 220 - age

  5. heatmap_cor(dataset, plot_title, method = "spearman")
    Plot correlation table as a heat map.

  6. run_random_forest(x_train,x_test,y_train,y_test, estimator = 10)
    Fit random forest to training data and return this model

  7. plot_confusion_matrix(y_test,x_pred,plot_title)
    Plot confusion matrix based on test data and predictions

  8. ml_train_test_split(x,y,size = 0.20,rs = 42)
    Split data to training and test data. The size of test data is defined according to "size" and rs is random_state argument in train_test_split method of sklearn.


Setup and running the code:

Clone the repo using the following command in terminal:
git clone https://github.com/avivfaraj/DSCI521-project.git

After cloning the repo, open hd_analysis.ipynb and run each cell one at a time in the order that they are presented. You can run the whole notebook in a single step by clicking on the menu Cell -> Run All.

The first two sections are packages and functions which are required for the code to run. Make sure to run those two sections before running the program.


Acknowledgements

UCI Heart Disease Data Set

Creators:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Donor: David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

About

A program in python to analyze data set that contains attributes which are related to heart disease.


Languages

Language:Jupyter Notebook 100.0%