This notebook demonstrates why OneHotEncoder
is better than pd.get_dummies
for creating dummy categorical variables in a machine learning context
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
Let's use a made-up dataset for the purpose of this example. Let's say we have total purchase amounts from customers in different states.
np.random.seed(2020)
amounts = np.random.choice(1000, 10)
ages = np.random.choice(100, 10)
states = np.random.choice(["Washington", "California", "Illinois"], 10)
df = pd.DataFrame([amounts, ages, states]).T
df.columns = ["Amount", "Age", "State"]
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Amount | Age | State | |
---|---|---|---|
0 | 864 | 29 | Illinois |
1 | 392 | 48 | California |
2 | 323 | 32 | Washington |
3 | 630 | 24 | Washington |
4 | 707 | 74 | Washington |
5 | 91 | 9 | Washington |
6 | 637 | 51 | Illinois |
7 | 643 | 11 | Washington |
8 | 583 | 55 | California |
9 | 952 | 62 | California |
Ok, let's say this is our training dataset. We want a linear regression model to predict the amount based on the age and state of the customer
To use this data in a linear regression model, we need to convert the categorical data to dummied-out numbers. First, let's try doing that with pd.get_dummies
dummies_df = pd.get_dummies(df, columns=["State"])
dummies_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Amount | Age | State_California | State_Illinois | State_Washington | |
---|---|---|---|---|---|
0 | 864 | 29 | 0 | 1 | 0 |
1 | 392 | 48 | 1 | 0 | 0 |
2 | 323 | 32 | 0 | 0 | 1 |
3 | 630 | 24 | 0 | 0 | 1 |
4 | 707 | 74 | 0 | 0 | 1 |
5 | 91 | 9 | 0 | 0 | 1 |
6 | 637 | 51 | 0 | 1 | 0 |
7 | 643 | 11 | 0 | 0 | 1 |
8 | 583 | 55 | 1 | 0 | 0 |
9 | 952 | 62 | 1 | 0 | 0 |
That was very easy, let's fit a linear regression model
dummies_model = LinearRegression()
dummies_model.fit(dummies_df.drop("Amount", axis=1), dummies_df["Amount"])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
dummies_coef = dummies_model.coef_
dummies_coef
array([ 4.89704939, -46.83843632, 134.78397121, -87.94553489])
dummies_intercept = dummies_model.intercept_
dummies_intercept
419.83405316798473
dummies_model.score(dummies_df.drop("Amount", axis=1), dummies_df["Amount"])
0.3343722589232698
So, we have an r-squared of 0.33 for our training data. Let's make up a few more records for testing on unseen data
np.random.seed(1)
test_amounts = np.random.choice(1000, 5)
test_ages = np.random.choice(100, 5)
test_states = np.random.choice(["Washington", "California", "Illinois"], 5)
test_df = pd.DataFrame([test_amounts, test_ages, test_states]).T
test_df.columns = ["Amount", "Age", "State"]
test_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Amount | Age | State | |
---|---|---|---|
0 | 37 | 9 | Washington |
1 | 235 | 75 | California |
2 | 908 | 5 | Washington |
3 | 72 | 79 | California |
4 | 767 | 64 | Washington |
The only states we have here are Washington and California. Let's dummy those out:
test_dummies_df = pd.get_dummies(test_df, columns=["State"])
test_dummies_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Amount | Age | State_California | State_Washington | |
---|---|---|---|---|
0 | 37 | 9 | 0 | 1 |
1 | 235 | 75 | 1 | 0 |
2 | 908 | 5 | 0 | 1 |
3 | 72 | 79 | 1 | 0 |
4 | 767 | 64 | 0 | 1 |
Now let's try to score our model on these:
dummies_model.score(test_dummies_df.drop("Amount", axis=1), test_dummies_df["Amount"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-1561818d2ab8> in <module>
----> 1 dummies_model.score(test_dummies_df.drop("Amount", axis=1), test_dummies_df["Amount"])
~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
420 from .metrics import r2_score
421 from .metrics._regression import _check_reg_targets
--> 422 y_pred = self.predict(X)
423 # XXX: Remove the check in 0.23
424 y_type, _, _, _ = _check_reg_targets(y, y_pred, None)
~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/linear_model/_base.py in predict(self, X)
223 Returns predicted values.
224 """
--> 225 return self._decision_function(X)
226
227 _preprocess_data = staticmethod(_preprocess_data)
~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/linear_model/_base.py in _decision_function(self, X)
207 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
208 return safe_sparse_dot(X, self.coef_.T,
--> 209 dense_output=True) + self.intercept_
210
211 def predict(self, X):
~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
149 ret = np.dot(a, b)
150 else:
--> 151 ret = a @ b
152
153 if (sparse.issparse(a) and sparse.issparse(b)
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 3)
We get an error, since the model was trained on a dataset with 4 features, but now we are trying to pass in only 3 features
This process will be a bit more annoying, but it won't break with the new data
# sparse=False makes it more readable but less efficient
ohe = OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False)
ohe_states_array = ohe.fit_transform(df[["State"]])
ohe_states_array
array([[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 0., 1.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.],
[1., 0., 0.]])
ohe_states_df = pd.DataFrame(ohe_states_array, index=df.index, columns=ohe.categories_[0])
ohe_states_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
California | Illinois | Washington | |
---|---|---|---|
0 | 0.0 | 1.0 | 0.0 |
1 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 |
3 | 0.0 | 0.0 | 1.0 |
4 | 0.0 | 0.0 | 1.0 |
5 | 0.0 | 0.0 | 1.0 |
6 | 0.0 | 1.0 | 0.0 |
7 | 0.0 | 0.0 | 1.0 |
8 | 1.0 | 0.0 | 0.0 |
9 | 1.0 | 0.0 | 0.0 |
ohe_df = pd.concat([df.drop("State", axis=1), ohe_states_df], axis=1)
ohe_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Amount | Age | California | Illinois | Washington | |
---|---|---|---|---|---|
0 | 864 | 29 | 0.0 | 1.0 | 0.0 |
1 | 392 | 48 | 1.0 | 0.0 | 0.0 |
2 | 323 | 32 | 0.0 | 0.0 | 1.0 |
3 | 630 | 24 | 0.0 | 0.0 | 1.0 |
4 | 707 | 74 | 0.0 | 0.0 | 1.0 |
5 | 91 | 9 | 0.0 | 0.0 | 1.0 |
6 | 637 | 51 | 0.0 | 1.0 | 0.0 |
7 | 643 | 11 | 0.0 | 0.0 | 1.0 |
8 | 583 | 55 | 1.0 | 0.0 | 0.0 |
9 | 952 | 62 | 1.0 | 0.0 | 0.0 |
This will look the same as the pd.get_dummies
version
ohe_model = LinearRegression()
ohe_model.fit(ohe_df.drop("Amount", axis=1), ohe_df["Amount"])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
print("Dummies Model:", dummies_coef)
print("OHE Model:", ohe_model.coef_)
Dummies Model: [ 4.89704939 -46.83843632 134.78397121 -87.94553489]
OHE Model: [ 4.89704939 -46.83843632 134.78397121 -87.94553489]
print("Dummies Model:", dummies_intercept)
print("OHE Model:", ohe_model.intercept_)
Dummies Model: 419.83405316798473
OHE Model: 419.83405316798473
This is where the encoder makes a difference!
# Reminder that this is our test data
test_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Amount | Age | State | |
---|---|---|---|
0 | 37 | 9 | Washington |
1 | 235 | 75 | California |
2 | 908 | 5 | Washington |
3 | 72 | 79 | California |
4 | 767 | 64 | Washington |
test_ohe_states_array = ohe.transform(test_df[["State"]])
test_ohe_states_df = pd.DataFrame(test_ohe_states_array, index=test_df.index, columns=ohe.categories_[0])
test_ohe_states_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
California | Illinois | Washington | |
---|---|---|---|
0 | 0.0 | 0.0 | 1.0 |
1 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 |
3 | 1.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 1.0 |
Notice that we now have the same columns as the training data, even though there were no "Illinois" values in the testing data
test_ohe_df = pd.concat([test_df.drop("State", axis=1), test_ohe_states_df], axis=1)
test_ohe_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Amount | Age | California | Illinois | Washington | |
---|---|---|---|---|---|
0 | 37 | 9 | 0.0 | 0.0 | 1.0 |
1 | 235 | 75 | 1.0 | 0.0 | 0.0 |
2 | 908 | 5 | 0.0 | 0.0 | 1.0 |
3 | 72 | 79 | 1.0 | 0.0 | 0.0 |
4 | 767 | 64 | 0.0 | 0.0 | 1.0 |
ohe_model.score(test_ohe_df.drop("Amount", axis=1), test_ohe_df["Amount"])
-0.7632751620783784
That is a very bad r-squared score, but that is to be expected for truly random data like this. The point is that we were able to make predictions on the new data, even though the categories present were not the exact same as the categories in the training data!