hoffm386 / get-dummies-vs-ohe-for-ml

A short example demonstrating why SciKit Learn's OneHotEncoder is a better solution than pd.get_dummies for machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pd.get_dummies vs. OneHotEncoder for Machine Learning

This notebook demonstrates why OneHotEncoder is better than pd.get_dummies for creating dummy categorical variables in a machine learning context

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

Let's use a made-up dataset for the purpose of this example. Let's say we have total purchase amounts from customers in different states.

np.random.seed(2020)
amounts = np.random.choice(1000, 10)
ages = np.random.choice(100, 10)
states = np.random.choice(["Washington", "California", "Illinois"], 10)
df = pd.DataFrame([amounts, ages, states]).T
df.columns = ["Amount", "Age", "State"]
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Amount Age State
0 864 29 Illinois
1 392 48 California
2 323 32 Washington
3 630 24 Washington
4 707 74 Washington
5 91 9 Washington
6 637 51 Illinois
7 643 11 Washington
8 583 55 California
9 952 62 California

Ok, let's say this is our training dataset. We want a linear regression model to predict the amount based on the age and state of the customer

Preprocessing with pd.get_dummies

To use this data in a linear regression model, we need to convert the categorical data to dummied-out numbers. First, let's try doing that with pd.get_dummies

dummies_df = pd.get_dummies(df, columns=["State"])
dummies_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Amount Age State_California State_Illinois State_Washington
0 864 29 0 1 0
1 392 48 1 0 0
2 323 32 0 0 1
3 630 24 0 0 1
4 707 74 0 0 1
5 91 9 0 0 1
6 637 51 0 1 0
7 643 11 0 0 1
8 583 55 1 0 0
9 952 62 1 0 0

Fitting a Model to Training Data

That was very easy, let's fit a linear regression model

dummies_model = LinearRegression()
dummies_model.fit(dummies_df.drop("Amount", axis=1), dummies_df["Amount"])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
dummies_coef = dummies_model.coef_
dummies_coef
array([  4.89704939, -46.83843632, 134.78397121, -87.94553489])
dummies_intercept = dummies_model.intercept_
dummies_intercept
419.83405316798473
dummies_model.score(dummies_df.drop("Amount", axis=1), dummies_df["Amount"])
0.3343722589232698

Testing on Unseen Data

So, we have an r-squared of 0.33 for our training data. Let's make up a few more records for testing on unseen data

np.random.seed(1)
test_amounts = np.random.choice(1000, 5)
test_ages = np.random.choice(100, 5)
test_states = np.random.choice(["Washington", "California", "Illinois"], 5)
test_df = pd.DataFrame([test_amounts, test_ages, test_states]).T
test_df.columns = ["Amount", "Age", "State"]
test_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Amount Age State
0 37 9 Washington
1 235 75 California
2 908 5 Washington
3 72 79 California
4 767 64 Washington

The only states we have here are Washington and California. Let's dummy those out:

test_dummies_df = pd.get_dummies(test_df, columns=["State"])
test_dummies_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Amount Age State_California State_Washington
0 37 9 0 1
1 235 75 1 0
2 908 5 0 1
3 72 79 1 0
4 767 64 0 1

Now let's try to score our model on these:

dummies_model.score(test_dummies_df.drop("Amount", axis=1), test_dummies_df["Amount"])
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-16-1561818d2ab8> in <module>
----> 1 dummies_model.score(test_dummies_df.drop("Amount", axis=1), test_dummies_df["Amount"])


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    420         from .metrics import r2_score
    421         from .metrics._regression import _check_reg_targets
--> 422         y_pred = self.predict(X)
    423         # XXX: Remove the check in 0.23
    424         y_type, _, _, _ = _check_reg_targets(y, y_pred, None)


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/linear_model/_base.py in predict(self, X)
    223             Returns predicted values.
    224         """
--> 225         return self._decision_function(X)
    226 
    227     _preprocess_data = staticmethod(_preprocess_data)


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/linear_model/_base.py in _decision_function(self, X)
    207         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    208         return safe_sparse_dot(X, self.coef_.T,
--> 209                                dense_output=True) + self.intercept_
    210 
    211     def predict(self, X):


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    149             ret = np.dot(a, b)
    150     else:
--> 151         ret = a @ b
    152 
    153     if (sparse.issparse(a) and sparse.issparse(b)


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 3)

Error!

We get an error, since the model was trained on a dataset with 4 features, but now we are trying to pass in only 3 features

Preprocessing with OneHotEncoder

This process will be a bit more annoying, but it won't break with the new data

# sparse=False makes it more readable but less efficient
ohe = OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False)
ohe_states_array = ohe.fit_transform(df[["State"]])
ohe_states_array
array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])
ohe_states_df = pd.DataFrame(ohe_states_array, index=df.index, columns=ohe.categories_[0])
ohe_states_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
California Illinois Washington
0 0.0 1.0 0.0
1 1.0 0.0 0.0
2 0.0 0.0 1.0
3 0.0 0.0 1.0
4 0.0 0.0 1.0
5 0.0 0.0 1.0
6 0.0 1.0 0.0
7 0.0 0.0 1.0
8 1.0 0.0 0.0
9 1.0 0.0 0.0
ohe_df = pd.concat([df.drop("State", axis=1), ohe_states_df], axis=1)
ohe_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Amount Age California Illinois Washington
0 864 29 0.0 1.0 0.0
1 392 48 1.0 0.0 0.0
2 323 32 0.0 0.0 1.0
3 630 24 0.0 0.0 1.0
4 707 74 0.0 0.0 1.0
5 91 9 0.0 0.0 1.0
6 637 51 0.0 1.0 0.0
7 643 11 0.0 0.0 1.0
8 583 55 1.0 0.0 0.0
9 952 62 1.0 0.0 0.0

Fitting a Model to Training Data

This will look the same as the pd.get_dummies version

ohe_model = LinearRegression()
ohe_model.fit(ohe_df.drop("Amount", axis=1), ohe_df["Amount"])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
print("Dummies Model:", dummies_coef)
print("OHE Model:", ohe_model.coef_)
Dummies Model: [  4.89704939 -46.83843632 134.78397121 -87.94553489]
OHE Model: [  4.89704939 -46.83843632 134.78397121 -87.94553489]
print("Dummies Model:", dummies_intercept)
print("OHE Model:", ohe_model.intercept_)
Dummies Model: 419.83405316798473
OHE Model: 419.83405316798473

Testing on Unseen Data

This is where the encoder makes a difference!

# Reminder that this is our test data
test_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Amount Age State
0 37 9 Washington
1 235 75 California
2 908 5 Washington
3 72 79 California
4 767 64 Washington
test_ohe_states_array = ohe.transform(test_df[["State"]])
test_ohe_states_df = pd.DataFrame(test_ohe_states_array, index=test_df.index, columns=ohe.categories_[0])
test_ohe_states_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
California Illinois Washington
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
4 0.0 0.0 1.0

Notice that we now have the same columns as the training data, even though there were no "Illinois" values in the testing data

test_ohe_df = pd.concat([test_df.drop("State", axis=1), test_ohe_states_df], axis=1)
test_ohe_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Amount Age California Illinois Washington
0 37 9 0.0 0.0 1.0
1 235 75 1.0 0.0 0.0
2 908 5 0.0 0.0 1.0
3 72 79 1.0 0.0 0.0
4 767 64 0.0 0.0 1.0
ohe_model.score(test_ohe_df.drop("Amount", axis=1), test_ohe_df["Amount"])
-0.7632751620783784

No Error!

That is a very bad r-squared score, but that is to be expected for truly random data like this. The point is that we were able to make predictions on the new data, even though the categories present were not the exact same as the categories in the training data!

About

A short example demonstrating why SciKit Learn's OneHotEncoder is a better solution than pd.get_dummies for machine learning

License:MIT License


Languages

Language:Jupyter Notebook 100.0%