In this lab, you'll investigate using scikit-learn with regularization in order to produce better models.
- Compare the different inputs with logistic regression models and determine the optimal model
# Import the necessary packages
Import the dataset stored in 'heart.csv'
.
# Import the data
df = None
# Print the first five rows of the data
Define X
and y
where the latter is the target
variable. This time, follow best practices and also implement a standard train-test split. Assign 25% to the test set and set the random_state
to 17.
# Define X and y
y = None
X = None
# Split the data into training and test sets
X_train, X_test, y_train, y_test = None
print(y_train.value_counts(),'\n\n', y_test.value_counts())
Use scikit-learn to build the logistic regression model.
Turn off the intercept and set the regularization parameter, C
, to a ridiculously large number such as 1e16.
# Your code here
Use both the training and test sets.
# Your code here
y_train_score = None
y_test_score = None
train_fpr, train_tpr, train_thresholds = None
test_fpr, test_tpr, test_thresholds = None
print('Train AUC: {}'.format(auc(train_fpr, train_tpr)))
print('Test AUC: {}'.format(auc(test_fpr, test_tpr)))
plt.figure(figsize=(10, 8))
lw = 2
plt.plot(train_fpr, train_tpr, color='blue',
lw=lw, label='Train ROC curve')
plt.plot(test_fpr, test_tpr, color='darkorange',
lw=lw, label='Test ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
Now add an intercept to the scikit-learn model. Keep the regularization parameter C
set to a very large number such as 1e16.
# Create new model
logregi = None
Generate predictions for the training and test sets.
# Generate predictions
y_hat_train = None
y_hat_test = None
Plot all three models ROC curves on the same graph.
# Initial model plots
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_hat_test)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_hat_train)
print('Custom Model Test AUC: {}'.format(auc(test_fpr, test_tpr)))
print('Custome Model Train AUC: {}'.format(auc(train_fpr, train_tpr)))
plt.figure(figsize=(10,8))
lw = 2
plt.plot(test_fpr, test_tpr, color='darkorange',
lw=lw, label='Custom Model Test ROC curve')
plt.plot(train_fpr, train_tpr, color='blue',
lw=lw, label='Custom Model Train ROC curve')
# Second model plots
y_test_score = logreg.decision_function(X_test)
y_train_score = logreg.decision_function(X_train)
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_score)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_score)
print('Scikit-learn Model 1 Test AUC: {}'.format(auc(test_fpr, test_tpr)))
print('Scikit-learn Model 1 Train AUC: {}'.format(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, color='yellow',
lw=lw, label='Scikit learn Model 1 Test ROC curve')
plt.plot(train_fpr, train_tpr, color='gold',
lw=lw, label='Scikit learn Model 1 Train ROC curve')
# Third model plots
y_test_score = None
y_train_score = None
test_fpr, test_tpr, test_thresholds = roc_curve(y_test, y_test_score)
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, y_train_score)
print('Scikit-learn Model 2 with intercept Test AUC: {}'.format(auc(test_fpr, test_tpr)))
print('Scikit-learn Model 2 with intercept Train AUC: {}'.format(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, color='purple',
lw=lw, label='Scikit learn Model 2 with intercept Test ROC curve')
plt.plot(train_fpr, train_tpr, color='red',
lw=lw, label='Scikit learn Model 2 with intercept Train ROC curve')
# Formatting
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()
Now, experiment with altering the regularization parameter. At a minimum, create 5 different subplots with varying regularization (C
) parameters. For each, plot the ROC curve of the training and test set for that specific model.
Regularization parameters between 1 and 20 are recommended. Observe the difference in test and training AUC as you go along.
# Your code here
How did the regularization parameter impact the ROC curves plotted above?
In this lab, you reviewed many of the accuracy measures for classification algorithms and observed the impact of additional tuning models using intercepts and regularization.