caesarmario / heart-disease-prediction-with-logistic-regression-SAS-studio

Heart disease prediction with logistic regression using SAS Studio. The dataset is taken from UCI Machine Learning about heart disease.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

๐Ÿ’•๐Ÿ’” Heart Disease EDA & Prediction ๐Ÿ”ฎ

w/ Logistic Regression using SAS Studio ๐Ÿ–ฅ


Star Badge Linktree Similar Works


๐Ÿ“ƒ Table of Contents:

๐Ÿ–‹ About Project:

๐Ÿ‘‰ This dataset contains information about contains diagnoses of heart disease patients. Machine learning model is needed in order to determine whether a person has heart disease or not.

๐Ÿ“Œ Objectives:

  • Perform dataset exploration using various type of visualizations.
  • Perform EDA on given dataset.
  • Build logistic regression model to predict heart disease status.

๐Ÿงพ Dataset Description:

๐Ÿ‘‰ There are 14 variables in this dataset:

  • 9 categorical variables, and
  • 5 continuous variables.

๐Ÿ‘‰ The structure of the two datasets that have been given:

Variable Name Description Sample Data
Age Patient Age
(in years)
63; 37; ...
Sex Gender of patient
(0 = male; 1 = female)
1; 0; ...
cp Chest pain type
(4 values: 0, 1, 2, 3)
3; 1; 2; ...
trestbps resting blood pressure
(in mm Hg)
145; 130; ...
chol Serum cholestoral
(in mg/dl)
233; 250; ...
fbs Fasting blood sugar > 120 mg/dl
(1 = true; 0 = false)
1; 0; ...
restecg Resting electrocardiographic results
(values 0, 1, 2)
0; 1; ...
thalach Maximum heart rate achieved 150; 187; ...
exang Exercise induced angina
(1 = yes; 0 = no)
1; 0; ...
oldpeak ST depression induced by exercise relative to rest 2.3; 3.5; ...
slope The slope of the peak exercise ST segment
(values 0, 1, 2)
0; 2; ...
ca number of major vessels (0-4) colored by flourosopy 0; 3; ...
thal (3 = normal; 6 = fixed defect; 7 = reversable defect) 1; 3; ...
Target Target column
(1 = Yes; 0 = No)
1; 0; ...


๐Ÿ“Š EDA:

๐Ÿ› Dataset Summary:

Dataset Summary 1
Dataset Summary 2

  • As mentioned above, there are 14 variables with 303 observations.


๐Ÿ” Univariate Analysis:

โ–ถ Univariate - Categorical:

  • sex (Gender)
    sex - UVC
    • The distribution of male patients are highest compared to female patients.

  • cp (Chest Pain Type)
    cp - UVC
    • Chest pain type 0 have the highest number compared to other types of chest pain.

  • fbs (Fasting Blood Sugar)
    fbs - UVC
    • It can be seen that the number of patients with fasting blood sugar less than 120 mg/dl have the highest numbers.

  • restecg (Resting Electrocardiographic Results)
    restecg - UVC
    • Resting electrocardiographic with results 1 and 0 has a higher distribution than result 2.
    • In addition, result 1 has the highest distribution compared to the other results.

  • exang (Exercise Induced Angina)
    exang - UVC
    • Patients with no exercise induced angina are the highest compared to patients with exercise induced angina.

  • slope (Slope of the Peak Exercise)
    slope - UVC
    • The distribution of slope 1 and 2 are almost the same.
    • Moreover, slope 2 has the highest distribution compared to others.

  • ca (Number of Major Vessels)
    ca - UVC
    • People with 0 major vessel has the highest distribution compared to others.

  • thal
    thal - UVC
    • Patients with 2 "thal" has the highest distribution compared to others.

  • target (Heart Diseases Status)
    target - UVC
    • The total number of patients that have heart diseases are higher than patients that have no heart diseases.


โ–ถ Univariate - Numerical:

  • age (Patient Age)
    age - UNC
    • From the histogram and boxplot, it can be seen that this column is normally distributed. This also proven by skewness value (-0.2) of this column.
    • In this column, the kurtosis value is -0.5, which indicates that the column is platikurtic.
    • From the Q-Q plot, the data values tend to closely follow the 45-degree, which means the data is likely normally distributed (as stated previously).

  • trestbps (Resting Blood Pressure in mm Hg)
    trestbps - UNC
    • From the histogram, it can be seen that this column is moderatly right skewed. This also proven by skewness value (0.7) of this column.
    • There are some outliers detected at the upper part of boxplot.
    • At the upper part of Q-Q plot, the data values tend to move away from 45-degree (there is a gap at upper part of Q-Q plot with 45-degree line), which means the data is likely moderatly right skewed (as stated previously).
    • In this column, the kurtosis value is 0.9, which indicates that the column is platikurtic.

  • chol (Serum Cholestoral in mg/dl)
    chol - UNC
    • From the histogram, it can be seen that this column is highly right skewed. This also proven by skewness value (1.1) of this column.
    • There are some outliers detected at the upper part of boxplot.
    • At the upper part of Q-Q plot, there is a gap at upper part of Q-Q plot with 45-degree line, which means the data is likely highly right skewed (as stated previously).
    • In this column, the kurtosis value is 4.5, which indicates that the column is leptokurtic.

  • thalach (Maximum Heart Rate)
    thalach - UNC
    • From the histogram, it can be seen that this column is moderatly left skewed. This also proven by skewness value (-0.5) of this column.
    • There is an outlier detected at the bottom part of boxplot.
    • At the upper part of Q-Q plot, there is a gap at bottom part of Q-Q plot with 45-degree line, which means the data is likely moderatly left skewed (as stated previously).
    • In this column, the kurtosis value is -0.06, which indicates that the column is platikurtic.

  • oldpeak
    oldpeak - UNC
    • From the histogram, it can be seen that this column is highly right skewed. This also proven by skewness value (1.3) of this column.
    • There are some outliers detected at the upper part of boxplot.
    • At the upper part of Q-Q plot, there is a gap at bottom part of Q-Q plot with 45-degree line, which means the data is likely highly right skewed (as stated previously).
    • In this column, the kurtosis value is 1.57, which indicates that the column is platikurtic.


1๏ธโƒฃ EDA 1:

EDA1

2๏ธโƒฃ EDA 2:

EDA2

3๏ธโƒฃ EDA 3:

EDA3

4๏ธโƒฃ EDA 4:

EDA4

5๏ธโƒฃ EDA 5:

EDA5


โš™ Dataset Pre-processing:

  • In the data pre-processing, one-hot encoding performed for these columns:
    • cp (into cp_0, cp_1, cp_2, and cp_3)
    • thal (into thal_0, thal_1, thal_2, and thal_3)
    • slope (into slope_0, slope_1, and slope_2)
  • After one-hot encoding performed, original columns (cp, thal, and slope) are dropped from the table.
  • Then, the observations will be splitted into 80% train and 20% test ratio using PROC SURVEYSELECT technique. Split Data
  • Next, the new columns (Selected) will be dropped in both train and test data.
  • Finally, the target values in test set will be change into NULL values.

Each step for data pre-processing are available on part no. 3 in main.sas file.


๐Ÿ‘จโ€๐Ÿ’ป Logistic Regression:

โ–ถ Building Logistic Regression Model:

Summary LR - 1 Summary LR - 2 Summary LR - 3

  • [Image 1] - In train set, there are 243 observations (no missing values detected). In addition, the number of patients with and without heart disease are equally balanced.
  • [Image 2] - The "Model Convergence Status" is Satisified, indicates that the developed logistic regression is good predictor in predicting patients status. This convergence status also supported from smaller AIC value compared to SC value.
  • [Image 3] - p-value under the column "Pr > ChiSq", that not all variables are significant in the model. The p-value has to be less than 0.05 in order for the variable to be significantly impacting the variation in the heart disease status. (Example of great values for prediction: sex, cp_0, exang, etc.)

โ–ถ Probability in Training:

Probability in Training

โ–ถ Predictions on Test:

Probability in Test


๐Ÿ“ฅ Output Delivery System:

  • Output Delivery System (ODS) is used to present the output data from SAS program in the form of a nicely presented report which would hep the user to be able to understand the output of their analysis much easier. For this case, the prediction exported as PDF file (.pdf)
  • The prediction report can be seen here.

Each step for creating output (ODS) file are available on part no. 5 in main.sas file.


๐Ÿ™Œ Support me!

๐Ÿ‘‰ If you find this project useful, please โญ this repository ๐Ÿ˜†!

๐ŸŽˆ Check out my work on Kaggle here using various machine learning models!


๐Ÿ‘‰ More about myself: here

About

Heart disease prediction with logistic regression using SAS Studio. The dataset is taken from UCI Machine Learning about heart disease.


Languages

Language:SAS 100.0%