TensorFlow and TensorBoard with Regularization

Purpose

The purpose of this lab is threefold.

to review using TensorFlow and TensorBoard for modeling and evaluation with neural networks
to review using data science pipelines and cross-validation with neural networks
to review using TensorFlow for neural network regularization

We'll be continuting our investigation of the canonical Titanic Data Set that we began previously.

The Titanic

The Titanic and it's data

RMS Titanic was a British passenger liner built by Harland and Wolf and operated by the White Star Line. It sank in the North Atlantic Ocean in the early morning hours of 15 April 1912, after striking an iceberg during her maiden voyage from Southampton, England to New York City, USA.

Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history's deadliest peacetime commercial marine disasters.

Though there were about 2,224 passengers and crew members, we are given data of about 1,300 passengers. Out of these 1,300 passengers details, about 900 data is used for training purpose and remaining 400 is used for test purpose. The test data has had the survived column removed and we'll use neural networks to predict whether the passengers in the test data survived or not. Both training and test data are not perfectly clean as we'll see.

Below is a picture of the Titanic Museum in Belfast, Northern Ireland.

from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://upload.wikimedia.org/wikipedia/commons/c/c0/Titanic_Belfast_HDR.jpg", width=400, height=400)

Data Dictionary

Survival : 0 = No, 1 = Yes
Pclass : A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
sibsp : The number of siblings / spouses aboard the Titanic
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch : The # of parents / children aboard the Titanic
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Ticket : Ticket number
Fare : Passenger fare (British pounds)
Cabin : Cabin number embarked
Embarked : Port of Embarkation
C = Cherbourg (now Cherbourg-en-Cotentin), France
Q = Queenstown (now Cobh), Ireland
S = Southampton, England
Name, Sex, Age (years) are all self-explanatory

Libraries and the Data

Importing libraries

# Load the germane libraries

import pandas as pd
import numpy as np
import seaborn as sns 
from pandas._libs.tslibs import timestamps
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler

import tensorflow as tf
import keras 
from keras import models
from sklearn.impute import SimpleImputer
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.losses import binary_crossentropy
from sklearn.model_selection import GridSearchCV
from keras.callbacks import EarlyStopping
from keras.regularizers import l2
from keras.wrappers.scikit_learn import KerasClassifier

# Load the TensorBoard notebook extension and related libraries
%load_ext tensorboard
import datetime

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard

Loading the data

# Load the data

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# We need to do this for when we mamke our predictions from the test data at the end
ids = test[['PassengerId']]

EDA and Preprocessing

Exploratory Data Analysis

You have already performed EDA on this data set. Look back on what you did before or see here.

Of course, feel free to re-run what you have done before or try out some other EDA as you find useful.

Preprocessing

Let's do the same prepricessing as before.

# Performing preprocessing on the train and test data will be more effecient if we combine the two date sets.
combined = pd.concat([train, test], axis=0, sort=False)

#Age column
combined['Age'].fillna(combined['Age'].median(),inplace=True) # Age

# Embarked column
combined['Embarked'].fillna(combined['Embarked'].value_counts().index[0], inplace=True) # Embarked
combined['Fare'].fillna(combined['Fare'].median(),inplace=True)

# Class column
d = {1:'1st',2:'2nd',3:'3rd'} #Pclass
combined['Pclass'] = combined['Pclass'].map(d) #Pclass

# Making Age into adult (1) and child (0)
combined['Child'] = combined['Age'].apply(lambda age: 1 if age>=18 else 0) 

# Break up the string that has the title and names
combined['Title'] = combined['Name'].str.split('.').str.get(0)  # output : 'Futrelle, Mrs'
combined['Title'] = combined['Title'].str.split(',').str.get(1) # output : 'Mrs '
combined['Title'] = combined['Title'].str.strip()               # output : 'Mrs'
combined.groupby('Title').count()

# Replace the French titles with Enlgish
french_titles = ['Don', 'Dona', 'Mme', 'Ms', 'Mra','Mlle']
english_titles = ['Mr', 'Mrs','Mrs','Mrs','Mrs','Miss']
for i in range(len(french_titles)):
    for j in range(len(english_titles)):
        if i == j:
            combined['Title'] = combined['Title'].str.replace(french_titles[i],english_titles[j])

# Seperate the titles into "major" and "others", the latter would be, e.g., Reverend
major_titles = ['Mr','Mrs','Miss','Master']
combined['Title'] = combined['Title'].apply(lambda title: title if title in major_titles else 'Others')

#Dropping the Irrelevant Columns
combined.drop(['PassengerId','Name','Ticket','Cabin'], 1, inplace=True)

# Getting Dummy Variables and Dropping the Original Categorical Variables
categorical_vars = combined[['Pclass','Sex','Embarked','Title','Child']] # Get Dummies of Categorical Variables
dummies = pd.get_dummies(categorical_vars,drop_first=True)
combined = combined.drop(['Pclass','Sex','Embarked','Title','Child'],axis=1)
combined = pd.concat([combined, dummies],axis=1)

# Separating the data back into train and test sets
test = combined[combined['Survived'].isnull()].drop(['Survived'],axis=1)
train = combined[combined['Survived'].notnull()]

# Training
X_train = train.drop(['Survived'],1)
y_train = train['Survived']

# Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
test = sc.fit_transform(test)

Neural Network Model

Building the model

Define the model as a pipeline

Let's use the data science pipeline for our neural network model.

As you are now using regularization to guard against high variance, i.e. overfitting the data, in the definition of the model below include dropout and/or l2 regularization. Also, feel free to experiment with different activation functions.

# It will help to define our model in terms of a pipeline
def build_classifier(optimizer):
# insert Sequential and layers here

    return classifier

Use grid search to find help you tune the parameters

You can play with optimizers, epochs, and batch sizes. The ones that we're suggesting are not necessarily the best.

# Grid Search
classifier = KerasClassifier(build_fn = build_classifier)
param_grid = dict(optimizer = ['Adam'],
                  epochs=[10, 20, 50],
                  batch_size=[16, 25, 32])
grid = GridSearchCV(estimator=classifier, param_grid=param_grid, scoring='accuracy')
grid_result = grid.fit(X_train, y_train)
best_parameters = grid.best_params_
best_accuracy = grid.best_score_

`TensorBoard`

TensorBoard is TensorFlow's visualization toolkit. It is a dashboard that provides visualization and tooling that is needed for machine learning experimentation. The code immediately below will allow us to use TensorBoard.

N.B. When we loaded the libraries, we loaded the TensorBoard notebook extension. (It is the last line of code in the first code chunk.)

# Clear out any prior log data.
!rm -rf logs
# Be careful not to run this command if already have trained your model and you want to use TensorBoard.

# Sets up a timestamped log directory
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

# Creates a file writer for the log directory.
file_writer = tf.summary.create_file_writer(log_dir)


# The callback function, which will be called in the fit()
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

Fitting the optimal model and evaluating with `TensorBoaard`

Define the early stopping callback. Use your best values from grid serarch with KerasClassifer and finally fit the model.

# Define the EarlyStopping object
early_stop = EarlyStopping(monitor='val_loss', min_delta=1e-8,
                           verbose=1, patience=5,
                           mode='min')

# Using KerasClassifier
classifier = KerasClassifier(build_fn = build_classifier,
                             optimizer=best_parameters['optimizer'],
                             batch_size=best_parameters['batch_size'],
                             epochs=best_parameters['epochs'])

# Fit the model with the tensorboard_callback
classifier.fit(X_train,
               y_train,
               verbose=1,
               callbacks=[early_stop, tensorboard_callback])


# Warning: If verbose = 0 (silent) or 2 (one line per epoch), then on TensorBoard's Graphs tab there will be an error.
# The other tabs in TensorBoard will still be function, but if you want the graphs then verbose needs to be 1 (progress bar).

# Call TensorBoard within SaturnCloud [Comment this out if you are not in SaturnCloud]
import os
print(f"https://{os.getenv('SATURN_JUPYTER_BASE_DOMAIN')}/proxy/8000/")
%tensorboard --logdir logs/fit --port 8000 --bind_all
# This will generate a hyperlink. Click on that to open TensorBoard!
# (You'll see a 404 error below the link, just ignore that.)

# Call TensorBoard [Not in SaturnCloud]
# Uncomment the next time if you are not in SC
# %tensorboard --logdir logs/fit

Results and Predictions

Calculate the predictions, save them as a csv, and print them.

# Your code here (use more cells if you need to)

Continue to tweak your model until you are happy with the results based on model evaluation.

Conclusion

Now that you have the TensorBoard to help you look at your model, you can better understand how to tweak your model.

How do your predictions compare to what you did last time?

Remember that your "fancier" model may be less accurate... but that is okay if that is the case since we're trying to guard against variance with regularization techniques.

dseo23 / enterprise-paired-nn-reg

TensorFlow and TensorBoard with Regularization

Purpose

The Titanic

The Titanic and it's data

Data Dictionary

Libraries and the Data

Importing libraries

Loading the data

EDA and Preprocessing

Exploratory Data Analysis

Preprocessing

Neural Network Model

Building the model

Define the model as a pipeline

Use grid search to find help you tune the parameters

`TensorBoard`

Fitting the optimal model and evaluating with `TensorBoaard`

Results and Predictions

Conclusion

About

Languages

TensorFlow and TensorBoard with Regularization

Purpose

The Titanic

The Titanic and it's data

Data Dictionary

Libraries and the Data

Importing libraries

Loading the data

EDA and Preprocessing

Exploratory Data Analysis

Preprocessing

Neural Network Model

Building the model

Define the model as a pipeline

Use grid search to find help you tune the parameters

TensorBoard

Fitting the optimal model and evaluating with TensorBoaard

Results and Predictions

Conclusion

About

Languages

`TensorBoard`

Fitting the optimal model and evaluating with `TensorBoaard`