ppainuly / Titanic-Machine-Learning

Predicting Survivors from Titanic Dataset. This is an entry for Kaggle.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Analysing titanic data and predicting Survivors based on Passenger class, Sex, Fare and Embarked location

This is an attempt to participate Kaggle's Machine Learning Prediction compitition (https://www.kaggle.com/c/titanic/data)

# import dependencies
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load dataset
titanic_df = pd.read_csv('data/train.csv')
# Set default figure size
sns.set(rc={'figure.figsize':(12,8)})
# QUick look at the data
titanic_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# total num of passengers 
print(f'There are {len(titanic_df)} total passengers in this training dataset')
There are 891 total passengers in this training dataset

Data Cleanup

# Looking at column types and count
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
#Check for null values in each column
titanic_df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
# Heatmap of null column values to get a better idea. Dark blue stands for null, 
# and off white means no null value ofr the respective column for each point.
sns.heatmap(titanic_df.isnull(),cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x118cf1f28>

png

We can see that about 20% of Age and majority of Cabin calues are null. 2 null values for Embarked column also.

# Drop Cabin column
titanic_df.drop('Cabin', axis=1, inplace=True)
# Drop nulls from the dataframe
titanic_df.dropna(inplace=True)
# Verifying that we dont have any more nulls. 
# Notice we dont see any black bars on the heatmap so all nulls have been dropped. - 

sns.heatmap(titanic_df.isnull(),cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x127870f28>

png

Data Analysis

How many survivals for Male and Female

sns.set_style("whitegrid") 
sns.countplot(x="Survived", hue="Sex", data=titanic_df, palette="Set3")
<matplotlib.axes._subplots.AxesSubplot at 0x12795f278>

png

Most passengers who did not survive were Male

How many survivals by passenger class?

sns.countplot(x="Survived", hue="Pclass", data=titanic_df, palette="Set3")
<matplotlib.axes._subplots.AxesSubplot at 0x127b8e3c8>

png

Most Passengers who did not survive belonged to Class 3 i.e the lowest class. Most people who survived belonged to Class 1, then Class 3 and then Class 2.

# What was the age distribution on the titanic
titanic_df['Age'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x127d4e9b0>

png

As we can see, the titanic was populated by more younger people of age < 30. Which means lots of children and young adults.

What were the ages of the survivors? What was the age distribution on the titanic

titanic_df[titanic_df['Survived'] == 1]['Age'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x127ef6b70>

png

Ages of 20-40 were more likely to survive, followed by Ages below 5. We would be inclined to think that children were more likelt to survive but Ages 10-20 have a lower number. We need to ask ourself if that is because there were a low number of population for ages 10-20.

(There were close to 40 passengers from ages 10-20 and close to 20 survived)

Analysing the fare distribution. How many people paid what sums of fare on the ship?

titanic_df['Fare'].hist(bins=20,color='y')
<matplotlib.axes._subplots.AxesSubplot at 0x127faa320>

png

What were the Fares of the survivors?

Fare distribution on the titanic below -

titanic_df[titanic_df['Survived'] == 1]['Fare'].hist(color='y')
<matplotlib.axes._subplots.AxesSubplot at 0x128185358>

png

Analysing how many survived from the three embarking stations

S - Southamption

C - Cherbourg

Q - Queenstown

sns.countplot(x="Survived", hue="Embarked", data=titanic_df, palette="Set1")
<matplotlib.axes._subplots.AxesSubplot at 0x12834fe80>

png

Most people who survived were from Southamption, but most people who did not survive also boarded from Southamption. Its safe to say the majority of the ship came from Southamption.

How do the Age values match with different passenger class

sns.violinplot(x='Pclass',y="Age",data=titanic_df)
sns.swarmplot(x='Pclass',y="Age",data=titanic_df,color='0.2')
<matplotlib.axes._subplots.AxesSubplot at 0x1284e5898>

png

The mean ages for class 1 are higher than class 2, which is higher than class 3. This is intuitive becasue richer people tend to be of older age. Class 1 fare is the most expensive so has a higher mean age.

Putting it together, We can plot for survived/Not Survived for Males/Females by Age Group and Fare paid for the ship.

0 - Not Survived

1 - Survived

Red circles are Male while Blue circle markers are Female

sns.scatterplot(x="Age", y="Fare",
                      hue="Sex", data=titanic_df,palette="Set1", size="Survived")
sns.set_style("ticks", {"xtick.major.size": 12, "ytick.major.size": 12})
sns.set_context("paper", font_scale=1.4)

png

Preparing data for Logistic Regression

We need to convert string values into binary (0 or 1) values

# Converting the Embarked column into a numerical binary value for Q,S and C. If both Q and C are 0, 
# then the value would automatically be C

embarked = pd.get_dummies(titanic_df['Embarked'], drop_first='True')
embarked.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Q S
0 0 1
1 0 0
2 0 1
3 0 1
4 0 1
# Converting the P assenger Class column into a numerical binary value for 1,2,3. If both 2 and 3 are 0, 
# then the value would automatically be class 1

pcl = pd.get_dummies(titanic_df['Pclass'], drop_first='True')
pcl.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
2 3
0 0 1
1 0 0
2 0 1
3 0 0
4 0 1
# Converting Sec column to a binary. If Male = 0, then the value would be a female automatically
sex = pd.get_dummies(titanic_df['Sex'], drop_first='True')
sex.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
male
0 1
1 0
2 0
3 0
4 1
# Combining the above dataframe to our titanic dataframe
titanic_df = pd.concat([titanic_df, embarked, pcl, sex], axis=1)
titanic_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked Q S 2 3 male
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 0 1 0 1 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 0 0 0 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 0 1 0 1 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 0 1 0 0 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 0 1 0 1 1
df_binary = titanic_df[["Survived","SibSp","Parch","Fare","Q","S",2,3,"male"]]
# Final dataset for Regression
df_binary.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Survived SibSp Parch Fare Q S 2 3 male
0 0 1 0 7.2500 0 1 0 1 1
1 1 1 0 71.2833 0 0 0 0 0
2 1 0 0 7.9250 0 1 0 1 0
3 1 1 0 53.1000 0 1 0 0 0
4 0 0 0 8.0500 0 1 0 1 1

Logistic Regression

Performing Machine Learning on the dataset prepared -

# Assigning dependant and independant variables 

# Survived column is our dependant variable. We are trying to predict this variable
y = df_binary['Survived']

# The other columns are out independant variables. Hence we will drop Survived column from the dataframe
X = df_binary.drop('Survived', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
classifier = LogisticRegression()
classifier
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
 classifier.fit(X_train, y_train)
/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")
Training Data Score: 0.7978910369068541
Testing Data Score: 0.7482517482517482
# Predict
predictions = classifier.predict(X_test)
predictions
array([1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0])
# Comparison of our prediction with actual result
pd.DataFrame({"Prediction": predictions, "Actual": y_test})
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Prediction Actual
689 1 1
279 1 1
508 0 0
9 1 1
496 1 1
150 0 0
474 1 0
469 1 1
794 0 0
864 0 0
553 0 1
226 0 1
204 0 1
713 0 0
751 0 1
349 0 0
74 0 1
321 0 0
743 0 0
873 0 0
647 0 1
327 1 1
684 0 0
769 0 0
91 0 0
272 1 1
770 0 0
27 1 0
141 1 1
733 0 0
... ... ...
741 0 0
636 0 0
672 0 0
345 1 1
68 0 1
357 1 0
514 0 0
81 0 1
231 0 0
881 0 0
174 0 0
188 0 0
419 1 0
319 1 1
876 0 0
808 0 0
706 1 1
534 1 0
554 1 1
90 0 0
99 0 0
608 1 1
869 0 1
148 0 0
666 0 0
582 0 0
44 1 1
236 0 0
780 1 1
884 0 0

143 rows × 2 columns

from sklearn.metrics import classification_report, accuracy_score
print(classification_report(predictions, y_test))
              precision    recall  f1-score   support

           0       0.80      0.78      0.79        87
           1       0.67      0.70      0.68        56

    accuracy                           0.75       143
   macro avg       0.74      0.74      0.74       143
weighted avg       0.75      0.75      0.75       143

Accuracy of our model

accuracy_score(predictions,y_test)
0.7482517482517482

Testing Kaggle's test dataset

test_df = pd.read_csv('data/test.csv')
passengerId = test_df['PassengerId']
test_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

Prepare/Clean Kaggle's test data for our model

# Converting the Embarked column into a numerical binary value for Q,S and C. If both Q and C are 0, 
# then the value would automatically be C

embarked_test = pd.get_dummies(test_df['Embarked'], drop_first='True')
pcl_test = pd.get_dummies(test_df['Pclass'], drop_first='True')
sex_test = pd.get_dummies(test_df['Sex'], drop_first='True')
test_df = pd.concat([test_df, embarked_test, pcl_test, sex_test], axis=1)
test_df = test_df[["SibSp","Parch","Fare","Q","S",2,3,"male"]]
test_df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
SibSp Parch Fare Q S 2 3 male
0 0 0 7.8292 1 0 0 1 1
1 1 0 7.0000 0 1 0 1 0
2 0 0 9.6875 1 0 1 0 1
3 0 0 8.6625 0 1 0 1 1
4 1 1 12.2875 0 1 0 1 0

Make Predictions using our model with Kaggle's test dataset

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
SibSp Parch Fare Q S 2 3 male
0 1 0 7.2500 0 1 0 1 1
1 1 0 71.2833 0 0 0 0 0
2 0 0 7.9250 0 1 0 1 0
3 1 0 53.1000 0 1 0 0 0
4 0 0 8.0500 0 1 0 1 1
test_df.isnull().count()
SibSp    418
Parch    418
Fare     418
Q        418
S        418
2        418
3        418
male     418
dtype: int64
test_df = test_df.fillna(test_df.mean())
prediction = classifier.predict(test_df)
output = pd.DataFrame({"PassengerId": passengerId,"Survived" : prediction})
output
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1
5 897 0
6 898 1
7 899 0
8 900 1
9 901 0
10 902 0
11 903 0
12 904 1
13 905 0
14 906 1
15 907 1
16 908 0
17 909 0
18 910 1
19 911 1
20 912 0
21 913 0
22 914 1
23 915 0
24 916 1
25 917 0
26 918 1
27 919 0
28 920 0
29 921 0
... ... ...
388 1280 0
389 1281 0
390 1282 0
391 1283 1
392 1284 0
393 1285 0
394 1286 0
395 1287 1
396 1288 0
397 1289 1
398 1290 0
399 1291 0
400 1292 1
401 1293 0
402 1294 1
403 1295 0
404 1296 0
405 1297 0
406 1298 0
407 1299 1
408 1300 1
409 1301 1
410 1302 1
411 1303 1
412 1304 1
413 1305 0
414 1306 1
415 1307 0
416 1308 0
417 1309 0

418 rows × 2 columns

#export output to csv 
output.to_csv('data/output.csv',index=False)

About

Predicting Survivors from Titanic Dataset. This is an entry for Kaggle.


Languages

Language:Jupyter Notebook 100.0%