ThompsonBethany01/Predicting-Diabetes-Onset

About the Project
a. Goals
b. Background
c. Deliverables
d. Project Outline
e. Acknowledgments
Data Dictionary
a. Original Dataframe
b. Added Features
Initial Thoughts & Hypotheses
a. Thoughts
b. Hypotheses
Project Steps
a. Acquire
b. Prepare
c. Explore
d. Model
e. Conclusions
How to Reproduce
a. Steps
Author

About the Project

Goals

The major goal of this project is to create a machine learning model that can predict a patient having diabetes or not. The model will base this on other diagnostic measures in the data, such as BMI and age. The data sample is for female patients at least 21 years of age or older with Pima Native American heritage.

Background

According to the U.S. Department of Health and Human Services here,

"Early detection and treatment of diabetes is an important step toward keeping people with diabetes healthy. It can help to reduce the risk of serious complications such as premature heart disease and stroke, blindness, limb amputations, and kidney failure... Many people with type 2 diabetes have no signs or symptoms, but do have risk factors... Early diagnosis of diabetes and pre-diabetes is important so that patients can begin to manage the disease early and potentially prevent or delay the serious disease complications that can decrease quality of life."

Deliverables

Jupyter notebook with full analysis process
- Title Data_Analysis within this repo; can also click here
Presentation on key insights and model performance
- View Canva presentation here
Tableau Storybooks
- Visualizing Clusters here

Project Outline

README: Project description, outline, etc.
Data_Analysis.ipynb: Complete Data Science pipeline of the project
Prepare.py: Module holding functions to prepare the dataframe

Acknowledgments

Data from UCI Machine Learning here.

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

Data Dictionary

Included in Original Data

Feature Name	Description
Outcome	Binary class for diabetic patient or non-diabetic patient
Pregnancies	Number of times pregnant
Glucose	Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Blood Pressure	Diastolic blood pressure (mm Hg)
Skin Thickness	Triceps skin fold thickness (mm)
Insulin	2-Hour serum insulin (mu U/ml)
BMI	Body mass index: weight in kg/(height in m)^2
Diabetes Pedigree Function	Measure of genetic influence
Age	Age of patient in years

Domain Knowledge

Click to Expand

Glucose
An oral glucose tolerance test measures blood glucose after not eating for at least 8 hours and 2 hours after drinking a glucose-containing beverage. This test is used to diagnose diabetes (200 mg/dl and above) or pre-diabetes (between 140 mg/dl and 199 mg/dl). read more here

Blood Pressure
High blood pressure means that blood is pumping through the heart and blood vessels with too much force. Over time, consistently high blood pressure tires the heart muscle and can enlarge it. Diabetes damages arteries and makes them targets for hardening, called atherosclerosis. That can cause high blood pressure. It is believed that factors such as MBI and diet contribute to both conditions. read more here

Skin Thickness
Triceps (back side middle upperarm) - A skinfold caliper is used to assess the skinfold thickness, so that a prediction of the total amount of body fat can be made. This method is based on the hypothesis that the body fat is equally distributed over the body and that the thickness of the skinfold is a measure for subcutaneous fat. read more here

Insulin
During prolonged fasting, when the patient's glucose level is reduced to <40 mg/dL, an elevated insulin level plus elevated levels of proinsulin and C-peptide suggest insulinoma. Insulin levels generally decline in patients with type 1 diabetes mellitus. In the early stage of type 2 diabetes, insulin levels are either normal or elevated. In the late stage of type 2 diabetes, insulin levels decline. In normal individuals, insulin levels parallel blood glucose levels. read more here

BMI (Body Mass Index)
BMI is used to determine obesity along with the skinfold thickness test. The World Health Organization (WHO) defines BMI as weight in kilograms divided by the square of your height in metres (kg/m2). High BMI is a risk factor for diabetes. read more here

Diabetes Pedigree Function
The hereditary risk one might have with the onset of diabetes mellitus. read more here

Features Created

Using pandas qcut to create equal bins or Kmeans to create clusters on one or two features. Clusters were split into dummy variables.

Feature Name	Description
age_bins	4 bins based on Age: (21, 24] < (24, 29] < (29, 41] < (41, 81] labeled 1,2,3,4 respectively
bmi_bins	3 bins based on BMI: (19, 29] < (29, 35] < (35, 67] labeled 1,2,3 respectively
bp_bins	3 bins based on blood pressure: (24, 68] < (68, 76] < (76, 122] labeled 1,2,3 respectively
high_bmi_bp	Boolean if patient has BMI in levels 2 or 3 and Blood Pressure in level 3
age_bmi_cluster	Cluster created on scaled train features Age and BMI
pregnancy_cluster	Cluster created on scaled train feature Pregnancies
insulin_and_glucose_cluster	Cluster created on scaled train features Insulin and Glucose

Initial Thoughts & Hypotheses

Thoughts

Research has shown that diabetes has many risk factors - health risks that increase a patient's predisposition to the disease. These include have a higher body mass index. Will this be reflected in the data from this project?

Hypotheses

Hypothesis - Age vs. Outcome

Null hypothesis: Age does not influence the rate of diabetes diagnosis.   
Alternative hypothesis: As age increases, so does the rate of diabetes diagnosis (in female patients +21 with Pima Indian heritage). 
Test: Pearson correlation coefficient   
Results: With a p-value less than alpha and a correlation coefficient of .24, we reject the null hypothesis.

Hypothesis - Body Mass Index vs. Outcome

Null hypothesis: There is no significant difference between BMI and diabetes diagnosis.   
Alternative hypothesis:</kbd> Populations with higher BMI have a significantly higher rate of diabetes (in female patients +21 with Pima Indian heritage).  
Test: One-tailed, one-sample T-test  
Results: With a p-value less than alpha and a t-value of 3.0, we reject the null hypothesis.

Hypothesis - Blood Pressure vs. Outcome

Null hypothesis: There is no significant difference between blood pressure and diabetes diagnosis.  
Alternative hypothesis: Populations with higher blood pressure have a significantly higher rate of diabetes (in female patients +21 with Pima Indian heritage).    
Test: One-tailed, one-sample T-test  
Results: With a p-value greater than alpha, we fail to reject the null hypothesis.

Project Steps

Acquire

Data acquired from Kaggle here. The dataframe is saved as a csv file and has over 700 observations. The nine features in the original dataframe are diagnostic measures of the patients (observations). Null values appear to be filled with 0.

Prepare

Functions to prepare the dataframe are stored within the PRepare.py module. The module functions:

replace 0 with the feature mean where appropriate (i.e. a patient cannot have a 0 BMI)
bin features by pandas qcut and kmeans clustering
split into train, validate, test (70% - 20% - 10% respectively)
scale the df's using MinMaxScaler

Explore

During exploration, I looked at interaction of Independent features vs. Outcome, Independent vs. Indpendent Features, and cluster subgroups vs. Outcome.

Model

First, a baseline model was created to compare the following model performances. The baseline was based on the most common outcome from the train df - 0 (not diabetic). Using 0 as the predicting for each observation, the baseline was 66% accurate on train. Note: Because each observation is predicted negative, the recall rate is 0%.

Various classification models were created by fitting to the train df. Models evaluated on train were:

Decision Tree
Random Forest
K-Nearest Neighbors
Ridge Classifier
SGD Classifier

Models evaluated on the validate df were:

Decision Tree
Random Forest
K-Nearest Neighbors

Final Model

Random Forest was the final model selected. It performed the best not only on accuracy, but recall and precision on Positive (predicted diabetic) cases. Because False Negative cases are the most harmful, emphasis was selecting a model the did best on Recall.

Model	RandomForest	(max_depth=5, random_state=123)	'Glucose', 'Age', 'BMI', 'insulin_glucose_cluster', 'DiabetesPedigreeFunction'
DF	Accuracy	Recall on Positive (predicting diabetic)	Precision on Positive (predicting diabetic)
Train	86%	75%	84%
Validate	78%	64%	73%
Test	75%	63%	70%

How It Works

A random forest model contains decision trees that operate together. Each tree randomly picks features with replacement ("bagging") before making its own prediction. The outcome with the most votes becomes the prediction.

Conclusions

Glucose had the highest impact on modeling, followed by age and BMI. However, only one cluster subgroup was significantly important in the final model, which was a cluster based on glucose. Next steps include creating different clusters to improve model performance.

The final model chosen was a random forest with max depth of 5. The model not only performed best on accuracy (predicting patient Outcome correctly) but also on positive case recall and precision (predicting a patient Outcome of diabetic correctly). This is important because the longer a diabetic patient goes without a diagnosis, the more complications can arise such as blindness and limb amputation.

How to Reproduce

~~Go over this Readme.md file.~~ ✅
Download Data_Analysis.ipynb, Prepare.py, and the dataset in your working directory.
Run this notebook.

Author

Bethany Thompson

ThompsonBethany01 / Predicting-Diabetes-Onset

Table of Contents