Jkim516 / dsc-pca-in-scikitlearn-lab-seattle-ds-062419

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pincipal Component Analysis in scikit-learn - Lab

Introduction

Now that you've seen a brief introduction to PCA, it's time to try implementing the algorithm on your own.

Objectives

You will be able to:

  • Perform PCA in Python and scikit-learn using Iris dataset
  • Measure the impact of PCA on the accuracy of classification algorithms
  • Plot the decision boundary of different classification experiments to visually inspect their performance.

Iris Dataset

To practice PCA, you'll take a look at the iris dataset. Run the cell below to load it.

from sklearn import datasets
import pandas as pd
 
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Target'] = iris.get('target')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

In a minute, you'll perform PCA and visualize the datasets principle components. Before, its helpful to get a little more context regarding the data that you'll be working with. Run the cell below in order to visualize the pairwise feature plots. With this, notice how the target labels are easily separable by any one of the given features.

import matplotlib.pyplot as plt
%matplotlib inline

pd.plotting.scatter_matrix(df, figsize=(10,10));

png

# Create features and Target dataset


# Your code here 
# Standardize the features


# Your code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sepal length sepal width petal length petal width
0 -0.900681 1.032057 -1.341272 -1.312977
1 -1.143017 -0.124958 -1.341272 -1.312977
2 -1.385353 0.337848 -1.398138 -1.312977
3 -1.506521 0.106445 -1.284407 -1.312977
4 -1.021849 1.263460 -1.341272 -1.312977

PCA Projection to 2D Space

Now its time to perform PCA! Project the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variance present in the data.

  • Initialize an instance of PCA from scikit-learn with 2 components
  • Fit the data to the model
  • Extract the first 2 principal components from the trained model
# Run the PCA algorithm


# Your code here 

To visualize the components, it will be useful to also look at the target associated with the particular observation. As such, append the target (flower name) to the principal components in a pandas dataframe.

# Create a new dataset from principal components 


# Your code here 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PC1 PC2 target
0 -2.264542 0.505704 Iris-setosa
1 -2.086426 -0.655405 Iris-setosa
2 -2.367950 -0.318477 Iris-setosa
3 -2.304197 -0.575368 Iris-setosa
4 -2.388777 0.674767 Iris-setosa

Great, you now have a set of two dimensions, reduced from four against our target variable, the flower name.

Visualize Principal Components

Using the target data, we can visualize the principal components according to the class distribution.

  • Create a scatter plot from principal components while color coding the examples
# Principal Componets scatter plot


# Your code here 

png

Explained Variance

You can see above that the three classes in the dataset are fairly well separable. As such, this compressed representation of the data is probably sufficient for the classification task at hand. Compare the variance in the overall dataset to that captured from your two primary components.

# Calculate the variance explained by pricipal components


# Your code here 
Variance of each component: [0.72770452 0.23030523]

 Total Variance Explained: 95.8

As you should see, these first two principal components account for the vast majority of the overall variance in the dataset. This is indicative of the total information encapsulated in the compressed representation compared to the original encoding.

Compare Performance of an Classifier with PCA

Since the principal components explain 95% of the variance in the data, it is interesting to consider how a classifier trained on the compressed version would compare to one trained on the original dataset.

  • Run a KNeighborsClassifier to classify the Iris dataset
  • Use a trai/test split of 80/20
  • For reproducability of results, set random state =9 for the split
  • Time the process for splitting, training and making prediction
# classification complete Iris dataset

# Your code here 
Accuracy: 1.0
Time Taken: 0.0017656260024523363

Great , so you can see that we are able to classify the data with 100% accuracy in the given time. Remember the time taken may different randomly based on the load on your cpu and number of processes running on your PC.

Now repeat the above process for dataset made from principal components

  • Run a KNeighborsClassifier to classify the Iris dataset with principal components
  • Use a trai/test split of 80/20
  • For reproducability of results, set random state =9 for the split
  • Time the process for splitting, training and making prediction
# Run the classifer on PCA'd data


# Your code here 
Accuracy: 0.9666666666666667
Time Taken: 0.00035927799763157964

While some accuracy is loss in this representation, the training time has vastly improved. In more complex cases, PCA can even improve the accuracy of some machine learning tasks. In particular, PCA can be useful to reduce overfitting.

# Plot decision boundary using principal components 
def decision_boundary(pred_func):
    
    #Set the boundary
    x_min, x_max = X.iloc[:, 0].min() - 0.5, X.iloc[:, 0].max() + 0.5
    y_min, y_max = X.iloc[:, 1].min() - 0.5, X.iloc[:, 1].max() + 0.5
    h = 0.01
    
    # build meshgrid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # plot the contour
    plt.figure(figsize=(15,10))
    plt.contourf(xx, yy, Z, cmap=plt.cm.afmhot)
    plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=plt.cm.Spectral, marker='x')

decision_boundary(lambda x: model.predict(x))

plt.title("decision boundary")
Text(0.5,1,'decision boundary')

png

Summary

In this lab you applied PCA to the popular Iris dataset. You looked at performance of a simple classifier and impact of PCA on it. From here, you'll continue to explore PCA at more fundamental levels.

About

License:Other


Languages

Language:Jupyter Notebook 100.0%