`pd.get_dummies` vs. `OneHotEncoder` for Machine Learning

This notebook demonstrates why OneHotEncoder is better than pd.get_dummies for creating dummy categorical variables in a machine learning context

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

Let's use a made-up dataset for the purpose of this example. Let's say we have total purchase amounts from customers in different states.

np.random.seed(2020)

amounts = np.random.choice(1000, 10)
ages = np.random.choice(100, 10)
states = np.random.choice(["Washington", "California", "Illinois"], 10)

df = pd.DataFrame([amounts, ages, states]).T
df.columns = ["Amount", "Age", "State"]

df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Amount	Age	State
0	864	29	Illinois
1	392	48	California
2	323	32	Washington
3	630	24	Washington
4	707	74	Washington
5	91	9	Washington
6	637	51	Illinois
7	643	11	Washington
8	583	55	California
9	952	62	California

Ok, let's say this is our training dataset. We want a linear regression model to predict the amount based on the age and state of the customer

Preprocessing with `pd.get_dummies`

To use this data in a linear regression model, we need to convert the categorical data to dummied-out numbers. First, let's try doing that with pd.get_dummies

dummies_df = pd.get_dummies(df, columns=["State"])
dummies_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Amount	Age	State_California	State_Illinois	State_Washington
0	864	29	0	1	0
1	392	48	1	0	0
2	323	32	0	0	1
3	630	24	0	0	1
4	707	74	0	0	1
5	91	9	0	0	1
6	637	51	0	1	0
7	643	11	0	0	1
8	583	55	1	0	0
9	952	62	1	0	0

Fitting a Model to Training Data

That was very easy, let's fit a linear regression model

dummies_model = LinearRegression()
dummies_model.fit(dummies_df.drop("Amount", axis=1), dummies_df["Amount"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

dummies_coef = dummies_model.coef_
dummies_coef

array([  4.89704939, -46.83843632, 134.78397121, -87.94553489])

dummies_intercept = dummies_model.intercept_
dummies_intercept

419.83405316798473

dummies_model.score(dummies_df.drop("Amount", axis=1), dummies_df["Amount"])

0.3343722589232698

Testing on Unseen Data

So, we have an r-squared of 0.33 for our training data. Let's make up a few more records for testing on unseen data

np.random.seed(1)

test_amounts = np.random.choice(1000, 5)
test_ages = np.random.choice(100, 5)
test_states = np.random.choice(["Washington", "California", "Illinois"], 5)

test_df = pd.DataFrame([test_amounts, test_ages, test_states]).T
test_df.columns = ["Amount", "Age", "State"]

test_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Amount	Age	State
0	37	9	Washington
1	235	75	California
2	908	5	Washington
3	72	79	California
4	767	64	Washington

The only states we have here are Washington and California. Let's dummy those out:

test_dummies_df = pd.get_dummies(test_df, columns=["State"])
test_dummies_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Amount	Age	State_California	State_Washington
0	37	9	0	1
1	235	75	1	0
2	908	5	0	1
3	72	79	1	0
4	767	64	0	1

Now let's try to score our model on these:

dummies_model.score(test_dummies_df.drop("Amount", axis=1), test_dummies_df["Amount"])

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-16-1561818d2ab8> in <module>
----> 1 dummies_model.score(test_dummies_df.drop("Amount", axis=1), test_dummies_df["Amount"])


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    420         from .metrics import r2_score
    421         from .metrics._regression import _check_reg_targets
--> 422         y_pred = self.predict(X)
    423         # XXX: Remove the check in 0.23
    424         y_type, _, _, _ = _check_reg_targets(y, y_pred, None)


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/linear_model/_base.py in predict(self, X)
    223             Returns predicted values.
    224         """
--> 225         return self._decision_function(X)
    226 
    227     _preprocess_data = staticmethod(_preprocess_data)


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/linear_model/_base.py in _decision_function(self, X)
    207         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    208         return safe_sparse_dot(X, self.coef_.T,
--> 209                                dense_output=True) + self.intercept_
    210 
    211     def predict(self, X):


~/.conda/envs/prework-labs/lib/python3.7/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    149             ret = np.dot(a, b)
    150     else:
--> 151         ret = a @ b
    152 
    153     if (sparse.issparse(a) and sparse.issparse(b)


ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 3)

Error!

We get an error, since the model was trained on a dataset with 4 features, but now we are trying to pass in only 3 features

Preprocessing with `OneHotEncoder`

This process will be a bit more annoying, but it won't break with the new data

# sparse=False makes it more readable but less efficient
ohe = OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False)

ohe_states_array = ohe.fit_transform(df[["State"]])

ohe_states_array

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.]])

ohe_states_df = pd.DataFrame(ohe_states_array, index=df.index, columns=ohe.categories_[0])

ohe_states_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	California	Illinois	Washington
0	0.0	1.0	0.0
1	1.0	0.0	0.0
2	0.0	0.0	1.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0
5	0.0	0.0	1.0
6	0.0	1.0	0.0
7	0.0	0.0	1.0
8	1.0	0.0	0.0
9	1.0	0.0	0.0

ohe_df = pd.concat([df.drop("State", axis=1), ohe_states_df], axis=1)
ohe_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Amount	Age	California	Illinois	Washington
0	864	29	0.0	1.0	0.0
1	392	48	1.0	0.0	0.0
2	323	32	0.0	0.0	1.0
3	630	24	0.0	0.0	1.0
4	707	74	0.0	0.0	1.0
5	91	9	0.0	0.0	1.0
6	637	51	0.0	1.0	0.0
7	643	11	0.0	0.0	1.0
8	583	55	1.0	0.0	0.0
9	952	62	1.0	0.0	0.0

Fitting a Model to Training Data

This will look the same as the pd.get_dummies version

ohe_model = LinearRegression()
ohe_model.fit(ohe_df.drop("Amount", axis=1), ohe_df["Amount"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

print("Dummies Model:", dummies_coef)
print("OHE Model:", ohe_model.coef_)

Dummies Model: [  4.89704939 -46.83843632 134.78397121 -87.94553489]
OHE Model: [  4.89704939 -46.83843632 134.78397121 -87.94553489]

print("Dummies Model:", dummies_intercept)
print("OHE Model:", ohe_model.intercept_)

Dummies Model: 419.83405316798473
OHE Model: 419.83405316798473

Testing on Unseen Data

This is where the encoder makes a difference!

# Reminder that this is our test data
test_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Amount	Age	State
0	37	9	Washington
1	235	75	California
2	908	5	Washington
3	72	79	California
4	767	64	Washington

test_ohe_states_array = ohe.transform(test_df[["State"]])
test_ohe_states_df = pd.DataFrame(test_ohe_states_array, index=test_df.index, columns=ohe.categories_[0])
test_ohe_states_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	California	Washington
0	0.0	1.0
1	1.0	0.0
2	0.0	1.0
3	1.0	0.0
4	0.0	1.0

Notice that we now have the same columns as the training data, even though there were no "Illinois" values in the testing data

test_ohe_df = pd.concat([test_df.drop("State", axis=1), test_ohe_states_df], axis=1)
test_ohe_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Amount	Age	California	Washington
0	37	9	0.0	1.0
1	235	75	1.0	0.0
2	908	5	0.0	1.0
3	72	79	1.0	0.0
4	767	64	0.0	1.0

ohe_model.score(test_ohe_df.drop("Amount", axis=1), test_ohe_df["Amount"])

-0.7632751620783784

No Error!

That is a very bad r-squared score, but that is to be expected for truly random data like this. The point is that we were able to make predictions on the new data, even though the categories present were not the exact same as the categories in the training data!

hoffm386 / get-dummies-vs-ohe-for-ml

`pd.get_dummies` vs. `OneHotEncoder` for Machine Learning

Preprocessing with `pd.get_dummies`

Fitting a Model to Training Data

Testing on Unseen Data

Error!

Preprocessing with `OneHotEncoder`

Fitting a Model to Training Data

Testing on Unseen Data

No Error!

About

Languages

	Amount	Age	State_California	State_Illinois	State_Washington
0	864	29	0	1	0
1	392	48	1	0	0
2	323	32	0	0	1
3	630	24	0	0	1
4	707	74	0	0	1
5	91	9	0	0	1
6	637	51	0	1	0
7	643	11	0	0	1
8	583	55	1	0	0
9	952	62	1	0	0

	California	Illinois	Washington
0	0.0	1.0	0.0
1	1.0	0.0	0.0
2	0.0	0.0	1.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0
5	0.0	0.0	1.0
6	0.0	1.0	0.0
7	0.0	0.0	1.0
8	1.0	0.0	0.0
9	1.0	0.0	0.0

	Amount	Age	California	Illinois	Washington
0	864	29	0.0	1.0	0.0
1	392	48	1.0	0.0	0.0
2	323	32	0.0	0.0	1.0
3	630	24	0.0	0.0	1.0
4	707	74	0.0	0.0	1.0
5	91	9	0.0	0.0	1.0
6	637	51	0.0	1.0	0.0
7	643	11	0.0	0.0	1.0
8	583	55	1.0	0.0	0.0
9	952	62	1.0	0.0	0.0

	Amount	Age	California	Washington
0	37	9	0.0	1.0
1	235	75	1.0	0.0
2	908	5	0.0	1.0
3	72	79	1.0	0.0
4	767	64	0.0	1.0

	Amount	Age	State_California	State_Illinois	State_Washington
0	864	29	0	1	0
1	392	48	1	0	0
2	323	32	0	0	1
3	630	24	0	0	1
4	707	74	0	0	1
5	91	9	0	0	1
6	637	51	0	1	0
7	643	11	0	0	1
8	583	55	1	0	0
9	952	62	1	0	0

	California	Illinois	Washington
0	0.0	1.0	0.0
1	1.0	0.0	0.0
2	0.0	0.0	1.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0
5	0.0	0.0	1.0
6	0.0	1.0	0.0
7	0.0	0.0	1.0
8	1.0	0.0	0.0
9	1.0	0.0	0.0

	Amount	Age	California	Illinois	Washington
0	864	29	0.0	1.0	0.0
1	392	48	1.0	0.0	0.0
2	323	32	0.0	0.0	1.0
3	630	24	0.0	0.0	1.0
4	707	74	0.0	0.0	1.0
5	91	9	0.0	0.0	1.0
6	637	51	0.0	1.0	0.0
7	643	11	0.0	0.0	1.0
8	583	55	1.0	0.0	0.0
9	952	62	1.0	0.0	0.0

	Amount	Age	California	Washington
0	37	9	0.0	1.0
1	235	75	1.0	0.0
2	908	5	0.0	1.0
3	72	79	1.0	0.0
4	767	64	0.0	1.0

pd.get_dummies vs. OneHotEncoder for Machine Learning

Preprocessing with pd.get_dummies

Fitting a Model to Training Data

Testing on Unseen Data

Error!

Preprocessing with OneHotEncoder

Fitting a Model to Training Data

Testing on Unseen Data

No Error!

About

Languages

`pd.get_dummies` vs. `OneHotEncoder` for Machine Learning

Preprocessing with `pd.get_dummies`

Preprocessing with `OneHotEncoder`

	Amount	Age	State_California	State_Illinois	State_Washington
0	864	29	0	1	0
1	392	48	1	0	0
2	323	32	0	0	1
3	630	24	0	0	1
4	707	74	0	0	1
5	91	9	0	0	1
6	637	51	0	1	0
7	643	11	0	0	1
8	583	55	1	0	0
9	952	62	1	0	0

	California	Illinois	Washington
0	0.0	1.0	0.0
1	1.0	0.0	0.0
2	0.0	0.0	1.0
3	0.0	0.0	1.0
4	0.0	0.0	1.0
5	0.0	0.0	1.0
6	0.0	1.0	0.0
7	0.0	0.0	1.0
8	1.0	0.0	0.0
9	1.0	0.0	0.0

	Amount	Age	California	Illinois	Washington
0	864	29	0.0	1.0	0.0
1	392	48	1.0	0.0	0.0
2	323	32	0.0	0.0	1.0
3	630	24	0.0	0.0	1.0
4	707	74	0.0	0.0	1.0
5	91	9	0.0	0.0	1.0
6	637	51	0.0	1.0	0.0
7	643	11	0.0	0.0	1.0
8	583	55	1.0	0.0	0.0
9	952	62	1.0	0.0	0.0

	Amount	Age	California	Washington
0	37	9	0.0	1.0
1	235	75	1.0	0.0
2	908	5	0.0	1.0
3	72	79	1.0	0.0
4	767	64	0.0	1.0