Summary

This project uses Machine Learning to predict the survival outcome for individual passengers on the Titanic. (Based on data from a Kaggle competition.)

End to end Python based Predictive Modeling
Logistic Regression, K-nearest Neighbor, Decision Tree, Random Forest, Support Vector Classification (SVC), XGBoost
Cross-validation
Grid and Random search to for model tuning
Ensembler for creating the best prediction. (Inspired by Ken Jee!)
The model reached 85% accuracy in the training data and 77% accuracy in the test data.
All the parameters were built/trained only using the training data, which is critical for industry applications.

Files	Notes
/module/helpers.py	tools built to facilitate EDA and preprocessing
titanic_EDA.ipynb	EDA (Exploratory Data Analysis)
titanic_preprocessing_feature_Engineering.ipynb	Data preprocess and feature engineering
titanic_model.ipynb	Machine Learning model building and turning

1. DEFINE the problem

The Titanic still occupies our mind 100 years after the disaster happened. Among the over 2000 passengers, about 1500 lost their lives. It would be vital to understand whether survival is related to other factors such as:

sex
fare
age
cabin
class
ticket
number of siblings and spouse
number of children and parents
embarked location

In this project, the training data has information on 891 participants and the test data has inforamtion on 418 and is asked to predict every one of them whether he/she was giong to survive.

2. DISCOVERY

2.1. EDA Study the training data and all the variables

2.1.1 check missing values, check variables in the datasets

2.1.2 distribution of the numeric variables and their correlation

Observations:

Age is pretty much normally distributed, the rest of the variables need normalization
Parch (number of parents/children aboard) is positively correlated with SibSp(number of sibling/spouse aboard)
age is negatiavelly correated with SibSp (number of siblings)

2.1.3 bart plots of the categorical variables

observations:

more people died than survived
more people are in the 3rd class cabin
more male than female
more people embarked from S than from C and Q

2.1.4 compare the survival rate across all the numeric variables (Age, SibSp, Parch, and Fare) and categorical variables (Sex, Pclass, Embarked)

Observations:

higher Faire has a higher survival rate
high Parch has a higher survival rate
lower SibSp and lower age has a higher survival rate
Survivied female > male
Survivied Pclass 1 > 2 > 3
Survivied Embarked C > O > S

2.1.5 Experimenting with feature engineering

Simplify Cabin by the number of cabins, NaN is counted as 0 cabins.
Observation:

people with 1, 2, 4 cabins have a higher survival propertion than nonsurvive

Simplify Cabin by the first letter of the cabin
Observation:
More people in the following categories survivied: B, D, E, F

Simplify Tickets by the first letter of the ticket
Observation:
More survival with the following ticket_firstletter: F, P
Very little survival with the following ticket_firstletter: A, W
moderatte survival rate with the following ticket_firstletter: C, None

Simplify Name by extracting the title

2.1.6 plot the survival rate in relation to multiple other features

Survival rate ~ Sex + Age

Observations:

male 20-40 yr many not survived
female has a large survival rate across all ages

Survival ~ Age + Sex + Pclass

Observation

male from age 20-40 in Pclass 2 and 3 mostly did not survive

Survival ~ Embarked + Age + Pclass

Observations:

Pclass 3 has a much lower survival rate than Pclass 1 and 2 across Sex and Embarked
Male when Embarked from Q has a particular lower survival rate than Embarked S and C

Survival ~ Age + Fare

Observation:

Higher fare has a higher survival rate across most of the age spectrum.
Younger age 0-10 has a higher survival rate
older age 60 + has a lower survival rate

Survival ~ cabin_firstletter

Observation:

most people fall in the category of n, which means none for cabin.
and in the n category the survival rate is lower than other categories

Survival ~ ticket_firstletter

Observations:

most people fall in the category of None, which means no ticket number.
and the survival rates are lower in the following categories: None, A, S, C, W

Survival ~ name_title_adv

Observations:

Most people fall in the Mr. category and it has a low survival rate
Category Msr, Miss, Master has a higher survival rate

conclusion:

based on EDA, the following variables should be included as features:
Pclass, name_title_adv, Sex, Age, Sibsp, Parch, Fare, Embarked, cabin_total, cabin_firstletter, ticket_firstletter

2.2 Preprocessing + Feature Engineer Extract feature from Name, Cabin, and Ticket

Organized and prepared a helper module for feature engineering (according to EDA) so it can be readily applied for both the training and test sets.

Convert Pclass to categorical
Fill in the empty cells of 'Embarked'
Normalize then fill in the empty cells for 'Fare'
Simplify Name by creating 'name_title_adv'
Simplify Cabin by creating 'cabin_firstletter' and 'cabin_total'
Simplify Ticket by creating 'ticket_firstletter'
Replace the values in 'name_title_adv' in the test set that is absent in the training with training mode
Fill the empty cells of Age by aggregrated values from 'name_title_adv'
Remove extra columns
Merge training and test together to create a consistent dummy-variable set across train and test, then separate the datasets
Scale the numeric columns for both datasets

3. DEVELOP: Model buidling and tuning Several Machine Learning models

sklearn
Tested multiple ML models: Naive Bayes, Logistic Regression, Decision Tree, K Nearest Neighbors, Random Forest, SVC, XGBoost
Used sklearn.ensemble to create a voting system
Used the average accuracy from 5-folds cross-validtion
Turned each model by either Grid Search or Random Search to improve accuracy

Accuracy improvment after tuning:

Algorithm	Accuracy with Default Parameters	Accuracy after Tuned
naive bayes	0.4668	N/A
Logistic regression	0.8193	0.8215
Decision tree	0.7924	N/A
k nearest neighbor	0.8149	0.8249
Random Forest	0.81	0.8372
SVC	0.8306	0.8350
xgboost	0.8305	0.8451

4. Prepare submission file using the algorithm of choice (xgboost after tuned)

The final model accuracy was 85% for the training data and 77% for the test data.

Feature Importance

Sex being male is the most important feature
Then the next important feature is whether the person is called Master
The next important feature is whether the passenger is in Pclass3

More feature engining can be investigated to increase the accuracy.

4. DEPLOY: Prepare submission file using the algorithm of choice (xgboost after tuned)

Take away: The final model accuracy was 85% for the training data and 77% for the test data. More feature engining can be investigated to increase the accuracy.

sindhri / titanic

Summary

1. DEFINE the problem

2. DISCOVERY

2.1. EDA Study the training data and all the variables

2.1.1 check missing values, check variables in the datasets

2.1.2 distribution of the numeric variables and their correlation

2.1.3 bart plots of the categorical variables

2.1.4 compare the survival rate across all the numeric variables (Age, SibSp, Parch, and Fare) and categorical variables (Sex, Pclass, Embarked)

2.1.5 Experimenting with feature engineering

2.1.6 plot the survival rate in relation to multiple other features

Survival rate ~ Sex + Age

Survival ~ Age + Sex + Pclass

Survival ~ Embarked + Age + Pclass

Survival ~ Age + Fare

Survival ~ cabin_firstletter

Survival ~ ticket_firstletter

Survival ~ name_title_adv

conclusion:

2.2 Preprocessing + Feature Engineer Extract feature from Name, Cabin, and Ticket

Organized and prepared a helper module for feature engineering (according to EDA) so it can be readily applied for both the training and test sets.

3. DEVELOP: Model buidling and tuning Several Machine Learning models

4. Prepare submission file using the algorithm of choice (xgboost after tuned)

The final model accuracy was 85% for the training data and 77% for the test data.

Feature Importance

More feature engining can be investigated to increase the accuracy.

4. DEPLOY: Prepare submission file using the algorithm of choice (xgboost after tuned)

About

Languages