SC1015 Mini-Project (AY21/22 Semester 2)

This project is done in partial completion of the module SC1015 Introduction to Data Science & Artificial Intelligence.

This is done by SC8 Team 04 which consists of:

Choo Jin Cheng (U2121190C)
Chua Min Min (U2121126G)
Poh Shi Qian (U2122452J)

Date completed: 24 April 2022

Below is just a summary of our project. For more information, please read mini_project.ipynb.

Real-life Problem

Stroke can often be caused by unhealthy lifestyle and other health problems. Are there any unconventional causes?

According to the World Stroke Organisation (n.d.), stroke is a "leading cause of death and disability globally". In 2019 alone, there were 6.6 million people who died stroke of varying severity (American Heart Association, 2021).

While age and chronic health conditions like heart diseases are commonly known to increase the chances of a person getting a stroke, there might be unconventional factors leading to a healthy person getting a stroke. Hence, this project aims to uncover, if any, correlations between unconventional factors like marital status and a person's chance of getting a stroke.

Data Science Question

Do unconventional features help to better predict whether a person will have / already has a stroke?

This is a Classification problem. Our goal is to find out if there is any unconventional feature that makes one more likely to get a stroke.

Dataset

This dataset is extracted from Kaggle. It has the following fields:

id: unique identifier
gender: "Male", "Female" or "Other"
age: age of the patient
hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
ever_married: "No" or "Yes"
work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
Residence_type: "Rural" or "Urban"
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Unconventional variables

ever_married
work_type
Residence_type

What did we do in this project?

Exploratory Data Analysis on the features

Plotting of graphs

Statistical summaries

Simple calculations

Correlation checks

Data Cleaning and Preparation

Removal of rows/columns

Replacement of values

Encoding (Label/One-Hot)

Post-cleaning Work

Correlation checks

Feature Selection (SelectKBest)

Machine Learning

k-Nearest Neighbors

XGBoost

Artificial Neural Network

Naive Bayes

Conclusion

New things we tried!

Chi-square test - this is for categorical features correlation check
One-Hot encoding - this is for categorical features that are non-binary
SelectKBest feature selection - this is to provide insights on variable importance
Synthetic Minority Over-sampling Technique (SMOTE) - this is to compensate for our heavily imbalanced data
k-Nearest Neighbors - model
XGBoost - model
Artificial Neural Network - model
Naive Bayes - model

Conclusion

Naive Bayes is the most ideal model for this dataset
Unconventional features can help to better predict if a person will have / already has a stroke
'work_type' is the most significant unconventional feature, followed by 'ever_married' and 'Residence_type'

References:

American Heart Association (2021). 2021 Heart Disease & Stroke Statistical Update Fact Sheet Global Burden of Disease. Professional Heart Daily. https://professional.heart.org/-/media/PHD-Files-2/Science-News/2/2021-Heart-and-Stroke-Stat-Update/2021_Stat_Update_factsheet_Global_Burden_of_Disease.pdf
Bariatric Department at Lafayette General Medical Center (2019). How Obesity Affects Stroke Risk. Ochsner Lafayette General. https://ochsnerlg.org/about-us/news/how-obesity-affects-stroke-risk
Huang, Y., Xu, S., Hua, J., Zhu, D., Liu, C., Hu, Y., Liu, T. & Xu, D. (2015). Association between job strain and risk of incident stroke: A meta-analysis. Neurology, 85(19), 1648-1654. https://doi.org/10.1212/WNL.0000000000002098
WebMD (2021). Top 10 Causes of Strokes - Risk Factors and How You Can Lower Your Risks. WebMD. https://www.webmd.com/stroke/guide/stroke-causes-risks
World Stroke Organization (n.d.). Learn about stroke. World Stroke Orgnization. https://www.world-stroke.org/world-stroke-day-campaign/why-stroke-matters/learn-about-stroke
Wyller, T. B. (1999). Stroke and gender. The journal of gender-specific medicine : JGSM : the official journal of the Partnership for Women's Health at Columbia, 2(3), 41–45.

min13489 / sc1015_mini_proj