covid-19 data-science data-visualization data-analysis machine-learning

Estimating the Probability of Confirmed COVID-19 Cases Taking into the Intensive Care Unit (ICU)

This repository includes the slides and coding parts for this project.

This project is carried out by Eda AYDIN, Zilan EROL under the supervision of Engin Deniz ALPMAN in the Data Science for the Public Good program

The dataset of this project is obtained from the Kaggle - COVID-19 - Clinical Data to assess diagnosis

Notebook of this project: Notebook

Note: The data sets to be used in the project comply with the health-ethical rules and are suitable for use as a license.

A. BUSINESS UNDERSTANDING

Business Goal Declaration

The COVID-19 pandemic has caused us to rethink the organization of the health system, as it affects the whole world about the inadequacy of the health system.

Some of the sentences we hear every day due to the increasing cases are

“The occupancy of the intensive care unit is increasing”,
“The intensive care units are full.”

pioneered this project.

Based on these sentences, in this project, it will be determined whether the COVID-19 patient needs to be treated in the intensive care unit to preserve the health services system capacity.

We have two purposes here.

Our first purpose is to give the most accurate answer to tertiary hospitals based on the available data, based on the patient's need for intensive care support. In this way, intensive care resources can be organized or patient transfer can be planned.
Our other purpose is to provide an accurate answer to local and temporary hospitals based on subsampling of widely available data and not needing intensive care support. Thus, physicians fighting on the front line can safely discharge patients and monitor them remotely.

The data sets to be used in the project comply with the health-ethical rules and are suitable for use as a license.

Our Business Problem

According to our declaraton of businness goal: our aim to predict admission to the ICU of confirmed COVID-19 cases.

B. DATA UNDERSTANDING

This dataset consists of:
- Patient demographic information
- Patient previous grouped diseases
- Blood results
- Vital signs
  - Disalostic Blood Pressure
  - Systolic Blood Pressure
  - Heart Rate
  - Respiratory Rate
  - Temprature
  - Oxygen Saturation

This dataset includes:
- Number of rows: 1925
- Number of inpatients : 385
- Number of features 230 + 1(target)
- Each paient's medical record in different window intervals are located on 5 different rows.

0 stands for negative and 1 stands for positive.

Patient with ID 1 experienced ICU admission in the first 2 hours from Hospital admission (0-2).
Patient with ID 2 experienced ICU admission after 12 hours from Hospital admission (ABOVE_12).
Patient with ID 11 experienced ICU admission between 6 and 12 hours from hospital admission (6 - 12)
Patients with ID 12 did not experienced ICU admission.
Patients with ID 14 experienced ICU admission between 4 and 6 hours from hospital admission (4 - 6)

PATIENT - ICU - WINDOW RELATIONSHIP (CUMULATIVE)

Number of patients admitted to the ICU between:
- 0-2 hours: 32
- 2-4 hours: 27
- 4-6 hours: 40
- 6-12 hours: 31
- above 12 hours: 65
Number of patients who experience ICU : 195
Number of patients who don't experience ICU: 190
Number of patients who back to the normal stage after the admission to ICU: 0

ICU distribution by number of patients

Window Intervals, ICU = 1 Distribution by patient number

C. DATA ANALYSIS

Preparatory Data Analysis

Data Dropping
- Column Uniqueness: Remove the duplicate columns
- Row Uniquness : Change the all columns by patient number, Drop patient rows when ICU = 1 at Windows 0-2. (We cannot use in the modelling part.)
- Drop - Illogical rows: There is no illogical columns
- Drop Null-Target rows: Drop the 199 patient ID information
Data Splitting (Train / Test) :
Outlier Handling : The data set scaled before. So we don't have to do anything.
Missing Data Handling : Fill all NaN variales bu using ffill and bfill method.
Feature Engineering

Exploratory Data Analysis

Visualizations After Preparatory Data Analysis
- Age Percentile by patients
- ICU Distribution by Window Intervals
- Does Age, Disease Groping influence taking the patient to an ICU bed?
Visualizations for Specific Features After Preparatory Data Analysis
- Time-Variant Features (Max)
- Time-Variant Features (Min)

D. DATA MODELLING

WINDOW 1 (0-2 Hours)

Data Modelling Process for Window 1

Preprocessing Data → Retrieve all data of the patient according to the 0-2 hour interval
Correlations → Identifying importances features that affect ICU value in the 0-2 hour range
- Respiratory Rate is the most important feature.
Feature Encoding → Handling categorical features
Data Modelling
- KNN → Accuracy: %74
- Random Forest Classifier → Accuracy : %78

WINDOW 2-3 (2-6 HOURS)

Data Modelling Process for Window 2-3

Preprocessing Data → Retrieve all data of the patient according to the 2-6 hour interval | Removing the patients where ICU = 1 in Window 1 to prevent the bias
Correlations → Identifying importances features that affect ICU value in the 2-6 hour range
- Respiratory Rate is the most important feature.
Feature Encoding → Handling categorical features
Data Modelling
- KNN → Accuracy: %72
- Random Forest Classifier → Accuracy : %84

WINDOW 4-5 (ABOVE 6 HOURS)

Data Modelling Process for Window 4-5

Preprocessing Data → Retrieve all data of the patient above 6 hours | Removing the patient where ICU = 1 between 0 - 6 Hours to prevent the bias
Correlations → Identifying importances features that affect ICU value above 6 hours
- Respiratory Rate is the most important feature.
Feature Encoding → Handling categorical features
Data Modelling
- KNN → Accuracy: %96
- Random Forest Classifier → Accuracy : %98

E. MODEL EVALUATION

Random Forest gave better results than KNN.
Random Forest Classifier models also gave equal results with RFC tuned models.
As the number of patients admitted to the intensive care unit increased as time progressed, the accuracy result increased accordingly.
The most important factor in the possibility of being admitted to the intensive care unit is the respiratory rate of the person.

F. RESOURCES

Notebook Resources

Website Resources

Stackoverflow
Pandas Seaborn Documentations
GeeksforGeeks
Others
Book Resouces
- Hands-on Machine Learning with Scikit-Learn, keras & Tensorflow
- Introduction to Machine learning
- Data Science from Scratch
- Python for Data Science Handbook
- Practical Statistics for Data Scientists
- Python for Data Analysis
- Fundamentals of Data Visualization

About

This repository includes the slides and coding parts for the Estimating the Probability of Confirmed COVID-19 Cases Taking into the Intensive Care Unit (ICU).

covid-19 data-science data-visualization data-analysis machine-learning

Apache License 2.0

Languages

Language:Jupyter Notebook 100.0%