Animesh1911 / Cardiovascular_Disease_Prediction

An end to end ML model to predict whether a person has cardiovascular disease or not based on various features.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cardiovascular_Disease_Prediction

An end to end ML model to predict whether a person has cardiovascular disease or not based on various features.

Data

Dataset taken from Kaggle. Link - https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

Total 70000 rows of data was present in the dataset.

Independent Features:

  • Age (in days)
  • Gender (1-Female, 2-Male)
  • Height, Weight
  • Systolic BP, Diastolic BP
  • Cholesterol - (1 normal, 2 above normal, 3 well above normal)
  • Glucose - (1 normal, 2 above normal, 3 well above normal)
  • Smoking, Alcohol intake, Physical activity

Output Feature: Cardio

Preprocessing and EDA

  • Converting Age values from no of days to years.
  • Plotting different graphs such as Boxplot, Histogram, Correlation Graph & Countplot to gather insights from the data.
  • Encoding the values of cholesterol and glucose to understand which level is having the highest impact on output.
  • Removing outliers from the data using IQR.
  • Applying StandardScaler to scale the data.

CORRELATIION PLOT BEFORE PREPROCESSING

corr_before

The main causes of cardiovascular disease are high blood pressure, overweight, smoking and cholesterol.

From this correlation graph we can see that only age, weight and cholesterol are having some impact on the output variable. Also, blood pressure is not having any importance on the output. This is due to the presence of outliers.

We also know that cholesterol and glucose have 3 different levels. So we should also try to figure out which level affects the output the most.

PLOTS
There are many outliers present in Systolic BP, Diastolic BP, Height and Wieght columns.

sys dia

h w

Boxplots for Systolic BP, Diastolic BP, Height and Weight showing the outliers.

gluc chol

gender smoke
Some information obtained form the dataset.
Percentage of male is 35. Percentage of smoker is 9. Percentage of alcoholic is 5. Percentage of active people is 80.

We can see that as the levels of glucose and cholestrol are increases, the chances of having cardiovascular disease may increase.

CORRELATION PLOT AFTER OUTLIER REMOVAL corr_after
After the preprocessing, we can clearly see which feature contributes directly and indirectly to the output variable.

  1. Age, weight and level 3 cholesterol contribute positively to the output.
  2. Systolic BP and Diastolic BP contribute positively to the output.
  3. Level 1 cholesterol contribute negatively to the output.

Rest of the features do not have much effect on the output.

We can also note that each level of cholesterol is highly correlated to the corresponding same level of glucose. This means that if a person has level 3 cholesterol, the chance of him/her having level 3 glucose is high. A similar case can be seen in systolic and diastolic BP which have a very high value of correlation.

Weight is also correlated to systolic and diastolic BP. Thus, if a person's weight increases his/her BP may also increase.


Training and Testing

After preprocessing 61774 rows were left. Used a train-test split of 80:20.

5 different models were used for training and testing:

  • Logistic Regression
  • Random Forest
  • SVM
  • XGBoost
  • K Nearest Neighbours

(NOTE: The models are not yet optimized using Cross Validation. Will update soon).

Comparision of the 5 models :

Accuracy

Class 1 Recall

ROC vs  MODELS

In medical classification, our main aim should be to reduce the number of false negative (class 1 recall should be high) because we do not want our model to predict a person who is having the disease (class 1) as not having the disease (class 0).

The XGBoost model outperforms all the other models in terms of accuracy as well as class 1 recall. Thus, we will use XGBoost model for deployment.

Deployment

Saved the StandardScaler model and the XGBoost model for deployment.

Used Flask for the backend and HTML/CSS for the frontend.

As the user enters his data on our website, we store the data in a numpy array. We use our saved StandardScaler model for scaling down this data and then we pass this data to our model for making prediction.

The output is given as a percentage, which states that this much chance is there for the person to have the disease.

LINK - https://cardio-disease-prediction-api.herokuapp.com/

WEB APP app

About

An end to end ML model to predict whether a person has cardiovascular disease or not based on various features.


Languages

Language:Jupyter Notebook 99.0%Language:HTML 0.6%Language:Python 0.5%