In this lab, you'll learn how to use scikit-learn's implementation of a KNN classifier on the classic Titanic dataset from Kaggle!
In this lab you will:
- Conduct a parameter search to find the optimal value for K
- Use a KNN classifier to generate predictions on a real-world dataset
- Evaluate the performance of a KNN model
Start by importing the dataset, stored in the titanic.csv
file, and previewing it.
# Your code here
# Import pandas and set the standard alias
# Import the data from 'titanic.csv' and store it in a pandas DataFrame
raw_df = None
# Print the head of the DataFrame to ensure everything loaded correctly
Great! Next, you'll perform some preprocessing steps such as removing unnecessary columns and normalizing features.
Preprocessing is an essential component in any data science pipeline. It's not always the most glamorous task as might be an engaging data visual or impressive neural network, but cleaning and normalizing raw datasets is very essential to produce useful and insightful datasets that form the backbone of all data powered projects. This can include changing column types, as in:
df['col_name'] = df['col_name'].astype('int')
Or extracting subsets of information, such as:
import re
df['street'] = df['address'].map(lambda x: re.findall('(.*)?\n', x)[0])
Note: While outside the scope of this particular lesson, regular expressions (mentioned above) are powerful tools for pattern matching! See the regular expressions official documentation here.
Since you've done this before, you should be able to do this quite well yourself without much hand holding by now. In the cells below, complete the following steps:
- Remove unnecessary columns (
'PassengerId'
,'Name'
,'Ticket'
, and'Cabin'
) - Convert
'Sex'
to a binary encoding, where female is0
and male is1
- Detect and deal with any missing values in the dataset:
- For
'Age'
, replace missing values with the median age for the dataset - For
'Embarked'
, drop the rows that contain missing values
- For
- One-hot encode categorical columns such as
'Embarked'
- Store the target column,
'Survived'
, in a separate variable and remove it from the DataFrame
# Drop the unnecessary columns
df = None
df.head()
# Convert Sex to binary encoding
df['Sex'] = None
df.head()
# Find the number of missing values in each column
# Impute the missing values in 'Age'
df['Age'] = None
df.isna().sum()
# Drop the rows missing values in the 'Embarked' column
df = None
df.isna().sum()
# One-hot encode the categorical columns
one_hot_df = None
one_hot_df.head()
# Assign the 'Survived' column to labels
labels = None
# Drop the 'Survived' column from one_hot_df
Now that you've preprocessed the data, it's time to split it into training and test sets.
In the cell below:
- Import
train_test_split
from thesklearn.model_selection
module - Use
train_test_split()
to split the data into training and test sets, with atest_size
of0.25
. Set therandom_state
to 42
# Import train_test_split
# Split the data
X_train, X_test, y_train, y_test = None
The final step in your preprocessing efforts for this lab is to normalize the data. We normalize after splitting our data into training and test sets. This is to avoid information "leaking" from our test set into our training set (read more about data leakage here ). Remember that normalization (also sometimes called Standardization or Scaling) means making sure that all of your data is represented at the same scale. The most common way to do this is to convert all numerical values to z-scores.
Since KNN is a distance-based classifier, if data is in different scales, then larger scaled features have a larger impact on the distance between points.
To scale your data, use StandardScaler
found in the sklearn.preprocessing
module.
In the cell below:
- Import and instantiate
StandardScaler
- Use the scaler's
.fit_transform()
method to create a scaled version of the training dataset - Use the scaler's
.transform()
method to create a scaled version of the test dataset - The result returned by
.fit_transform()
and.transform()
methods will be numpy arrays, not a pandas DataFrame. Create a new pandas DataFrame out of this object calledscaled_df
. To set the column names back to their original state, set thecolumns
parameter toone_hot_df.columns
- Print the head of
scaled_df
to ensure everything worked correctly
# Import StandardScaler
# Instantiate StandardScaler
scaler = None
# Transform the training and test sets
scaled_data_train = None
scaled_data_test = None
# Convert into a DataFrame
scaled_df_train = None
scaled_df_train.head()
You may have noticed that the scaler also scaled our binary/one-hot encoded columns, too! Although it doesn't look as pretty, this has no negative effect on the model. Each 1 and 0 have been replaced with corresponding decimal values, but each binary column still only contains 2 values, meaning the overall information content of each column has not changed.
Now that you've preprocessed the data it's time to train a KNN classifier and validate its accuracy.
In the cells below:
- Import
KNeighborsClassifier
from thesklearn.neighbors
module - Instantiate the classifier. For now, you can just use the default parameters
- Fit the classifier to the training data/labels
- Use the classifier to generate predictions on the test data. Store these predictions inside the variable
test_preds
# Import KNeighborsClassifier
# Instantiate KNeighborsClassifier
clf = None
# Fit the classifier
# Predict on the test set
test_preds = None
Now, in the cells below, import all the necessary evaluation metrics from sklearn.metrics
and complete the print_metrics()
function so that it prints out Precision, Recall, Accuracy, and F1-Score when given a set of labels
(the true values) and preds
(the models predictions).
Finally, use print_metrics()
to print the evaluation metrics for the test predictions stored in test_preds
, and the corresponding labels in y_test
.
# Your code here
# Import the necessary functions
# Complete the function
def print_metrics(labels, preds):
print("Precision Score: {}".format(None))
print("Recall Score: {}".format(None))
print("Accuracy Score: {}".format(None))
print("F1 Score: {}".format(None))
print_metrics(y_test, test_preds)
Interpret each of the metrics above, and explain what they tell you about your model's capabilities. If you had to pick one score to best describe the performance of the model, which would you choose? Explain your answer.
Write your answer below this line:
While your overall model results should be better than random chance, they're probably mediocre at best given that you haven't tuned the model yet. For the remainder of this notebook, you'll focus on improving your model's performance. Remember that modeling is an iterative process, and developing a baseline out of the box model such as the one above is always a good start.
First, try to find the optimal number of neighbors to use for the classifier. To do this, complete the find_best_k()
function below to iterate over multiple values of K and find the value of K that returns the best overall performance.
The function takes in six arguments:
X_train
y_train
X_test
y_test
min_k
(default is 1)max_k
(default is 25)
Pseudocode Hint:
- Create two variables,
best_k
andbest_score
- Iterate through every odd number between
min_k
andmax_k + 1
.- For each iteration:
- Create a new
KNN
classifier, and set then_neighbors
parameter to the current value for k, as determined by the loop - Fit this classifier to the training data
- Generate predictions for
X_test
using the fitted classifier - Calculate the F1-score for these predictions
- Compare this F1-score to
best_score
. If better, updatebest_score
andbest_k
- Create a new
- For each iteration:
- Once all iterations are complete, print the best value for k and the F1-score it achieved
def find_best_k(X_train, y_train, X_test, y_test, min_k=1, max_k=25):
# Your code here
pass
find_best_k(scaled_data_train, y_train, scaled_data_test, y_test)
# Expected Output:
# Best Value for k: 17
# F1-Score: 0.7468354430379746
If all went well, you'll notice that model performance has improved by 3 percent by finding an optimal value for k. For further tuning, you can use scikit-learn's built-in GridSearch()
to perform a similar exhaustive check of hyperparameter combinations and fine tune model performance. For a full list of model parameters, see the sklearn documentation !
As an optional (but recommended!) exercise, think about the decisions you made during the preprocessing steps that could have affected the overall model performance. For instance, you were asked to replace the missing age values with the column median. Could this have affected the overall performance? How might the model have fared if you had just dropped those rows, instead of using the column median? What if you reduced the data's dimensionality by ignoring some less important columns altogether?
In the cells below, revisit your preprocessing stage and see if you can improve the overall results of the classifier by doing things differently. Consider dropping certain columns, dealing with missing values differently, or using an alternative scaling function. Then see how these different preprocessing techniques affect the performance of the model. Remember that the find_best_k()
function handles all of the fitting; use this to iterate quickly as you try different strategies for dealing with data preprocessing!
Well done! In this lab, you worked with the classic Titanic dataset and practiced fitting and tuning KNN classification models using scikit-learn! As always, this gave you another opportunity to continue practicing your data wrangling skills and model tuning skills using Pandas and scikit-learn!