CECS1020 Final Project - Titanic Prediction

Introduction

This is the Group project for CECS1020 class - Introduction to Machine Learning of VinUniversity. We are group 6:

Nguyen Tiet Nguyen Khoi
Nguyen Duong Tung
Nguyen Hoang Trung Dung

This zip folder includes:

Final report (using Latex)
ipynb file for the coding implementation
Slides for the group presentation

Additional links:

Note:

Due to the report's length requirement, we could not put everything we have done into it. Please check through our ipynb file for full implementation.
We made our slides on Google Slide. When converting it into the Powerpoint slide, it may has some visualization errors. Please access our Google Slides link above if you encounter such errors.
The report's length requirement is 6 pages excluding the references. However, we did 7 pages (excluding the references and 2 first pages for outline).

The Challege (Kaggle)

"The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc)."

Implementation (ipynb file)

Importing Necessary Libraries

Preprocessing Part

Method 1: Logistic Regression

from sklearn.model_selection import train_test_split
y = train['Survived']
x = train_scaled
X_train,X_valid,y_train,y_valid = train_test_split(x,y,test_size=0.2,random_state=42)

#Build logistic regression model

from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(max_iter = 10000)
lr.fit(X_train,y_train)

from sklearn.metrics import accuracy_score
y_pred=lr.predict(X_valid)
accuracy_score(y_valid,y_pred)

from sklearn.metrics import confusion_matrix
confusion_matrix(y_valid,y_pred)

Method 2: Decision Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report as cr

X_train,X_valid,y_train,y_valid = train_test_split(x,y,test_size=0.2,random_state=42)

dtc=DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_predict=dtc.predict(X_valid)

confusion_matrix(y_valid,y_predict)

accuracy_score(y_valid,y_predict)

print(cr(y_valid,y_predict))

Method 3: Random Forest

from sklearn.ensemble import RandomForestClassifier as rc
X_train,X_valid,y_train,y_valid = train_test_split(x,y,test_size=0.2,random_state=42)

rfc=rc()
rfc.fit(X_train,y_train)
rfc_y_pred=rfc.predict(X_valid)
accuracy_score(y_valid,rfc_y_pred)

print(cr(y_valid,rfc_y_pred))