In this lesson, we'll dig deeper into the work horse of deep learning, Multi-Layer Perceptrons! We'll build and train a couple of different MLPs with Keras and explore the tradeoffs that come with adding extra hidden layers. We'll also try switching between some of the activation functions we learned about in the previous lesson to see how they affect training and performance.
- Build a deep neural network using Keras
Run the cell below to import everything we'll need for this lab.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import keras
from keras.models import Sequential
from keras.layers import Dense
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler, LabelBinarizer
For this lab, we'll be working with the Boston Breast Cancer Dataset. Although we're importing this dataset directly from scikit-learn, the Kaggle link above contains a detailed explanation of the dataset, in case you're interested. We recommend you take a minute to familiarize yourself with the dataset before digging in.
In the cell below:
- Call
load_breast_cancer()
to store the dataset - Access the
.data
,.target
, and.feature_names
attributes and store them in the appropriate variables below
bc_dataset = None
data = None
target = None
col_names = None
Now, let's create a DataFrame so that we can see the data and explore it a bit more easily with the column names attached.
- In the cell below, create a pandas DataFrame from
data
(usecol_names
for column names) - Print the
.head()
of the DataFrame
df = None
In order to pass this data into a neural network, we'll need to make sure that the data:
- is purely numerical
- contains no missing values
- is normalized
Let's begin by calling the DataFrame's .info()
method to check the datatype of each feature.
From the output above, we can see that the entire dataset is already in numerical format. We can also see from the counts that each feature has the same number of entries as the number of rows in the DataFrame -- that means that no feature contains any missing values. Great!
Now, let's check to see if our data needs to be normalized. Instead of doing statistical tests here, let's just take a quick look at the .head()
of the DataFrame again. Do this in the cell below.
As we can see from comparing mean radius
and mean area
, columns are clearly on different scales, which means that we need to normalize our dataset. To do this, we'll make use of scikit-learn's StandardScaler()
class.
In the cell below, instantiate a StandardScaler
and use it to create a normalized version of our dataset.
scaler = None
scaled_data = None
If you took a look at the data dictionary on Kaggle, then you probably noticed the target for this dataset is to predict if the sample is "M" (Malignant) or "B" (Benign). This means that this is a Binary Classification task, so we'll need to binarize our labels.
In the cell below, make use of scikit-learn's LabelBinarizer()
class to create a binarized version of our labels.
binarizer = None
labels = None
Now, we'll build a small Multi-Layer Perceptron using Keras in the cell below. Our first model will act as a baseline, and then we'll make it bigger to see what happens to model performance.
In the cell below:
- Instantiate a
Sequential()
Keras model - Use the model's
.add()
method to add aDense
layer with 10 neurons and a'tanh'
activation function. Also set theinput_shape
attribute to(30,)
, since we have 30 features - Since this is a binary classification task, the output layer should be a
Dense
layer with a single neuron, and the activation set to'sigmoid'
model_1 = None
Now that we've created the model, the next step is to compile it.
In the cell below, compile the model. Set the following hyperparameters:
loss='binary_crossentropy'
optimizer='sgd'
metrics=['acc']
Now, let's fit the model. Set the following hyperparameters:
epochs=25
batch_size=1
validation_split=0.2
results_1 = None
Note that when you call a Keras model's .fit()
method, it returns a Keras callback containing information on the training process of the model. If you examine the callback's .history
attribute, you'll find a dictionary containing both the training and validation loss, as well as any metrics we specified when compiling the model (in this case, just accuracy).
Let's quickly plot our validation and accuracy curves and see if we notice anything. Since we'll want to do this anytime we train an MLP, its worth wrapping this code in a function so that we can easily reuse it.
In the cell below, we created a function for visualizing the loss and accuracy metrics.
def visualize_training_results(results):
history = results.history
plt.figure()
plt.plot(history['val_loss'])
plt.plot(history['loss'])
plt.legend(['val_loss', 'loss'])
plt.title('Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()
plt.figure()
plt.plot(history['val_acc'])
plt.plot(history['acc'])
plt.legend(['val_acc', 'acc'])
plt.title('Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.show()
visualize_training_results(results_1)
You'll probably notice that the model did pretty well! It's always recommended to visualize your training and validation metrics against each other after training a model. By plotting them like this, we can easily detect when the model is starting to overfit. We can tell that this is happening by seeing the model's training performance steadily improve long after the validation performance plateaus. We can see that in the plots above as the training loss continues to decrease and the training accuracy continues to increase, and the distance between the two lines gets greater as the epochs gets higher.
By adding another hidden layer, we can a given the model the ability to capture more high-level abstraction in the data. However, increasing the depth of the model also increases the amount of data the model needs to converge to answer, because with a more complex model comes the "Curse of Dimensionality", thanks to all the extra trainable parameters that come from adding more size to our network.
If there is complexity in the data that our smaller model was not big enough to catch, then a larger model may improve performance. However, if our dataset isn't big enough for the new, larger model, then we may see performance decrease as then model "thrashes" about a bit, failing to converge. Let's try and see what happens.
In the cell below, recreate the model that you created above, with one exception. In the model below, add a second Dense
layer with 'tanh'
activation function and 5 neurons after the first. The network's output layer should still be a Dense
layer with a single neuron and a 'sigmoid'
activation function, since this is still a binary classification task.
Create, compile, and fit the model in the cells below, and then visualize the results to compare the history.
model_2 = None
results_2 = None
visualize_training_results(results_2)
Although the final validation score for both models is the same, this model is clearly worse because it hasn't converged yet. We can tell because of the greater variance in the movement of the val_loss
and val_acc
lines. This suggests that we can remedy this by either:
- Decreasing the size of the network, or
- Increasing the size of our training data
As a final exercise, let's create a third model that is the same as the first model we created earlier. The only difference is that we will train it on our raw dataset, not the normalized version. This way, we can see how much of a difference normalizing our input data makes.
Create, compile, and fit a model in the cell below. The only change in parameters will be using data
instead of scaled_data
during the .fit()
step.
model_3 = None
results_3 = None
visualize_training_results(results_3)
Wow! Our results were much worse -- over 20% poorer performance when working with non-normalized input data!
In this lab, we got some practice creating Multi-Layer Perceptrons, and explored how things like the number of layers in a model and data normalization affect our overall training results!