amueller / introduction_to_ml_with_python

Notebooks and code for the book "Introduction to Machine Learning with Python"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with Boston Housing Data

sdempwolf opened this issue · comments

hello,
On Sep 28 2022 I was working with the Boston Housing data and the exercises in module 02 supervised-learning. We received a message that there was an ethical problem with the Boston Housing data and that scikit-learn was recommending a switch to the California Housing data, for which they provided links.
I ended up modifying the mglearn/datasets.py file, adding the import line and a function load_extended_california(). This allows the rest of the code in the notebook to function as written with the California housing data.

from sklearn.datasets import fetch_california_housing

def load_extended_california():
housing = fetch_california_housing()
X = housing.data

X = MinMaxScaler().fit_transform(housing.data)
X = PolynomialFeatures(degree=2, include_bias=False).fit_transform(X)
return X, housing.target

Hi!
Yes, I was part of the discussion of making that change in sklearn. Since the book is using this dataset, the repo will continue to use that dataset. If I end up revising the book (somewhat unlikely at this point), I will replace the dataset.

Hi! Yes, I was part of the discussion of making that change in sklearn. Since the book is using this dataset, the repo will continue to use that dataset. If I end up revising the book (somewhat unlikely at this point), I will replace the dataset.

Hi Andreas,
I love using your book & notebooks in my classes. However, I don't want to have to revert to sklearn <1.2. I tried just replacing the references to Boston housing dataset with California housing data, but unsuccessful. Can you please point me to the files where this change needs to occur, as I must be missing one somehow. Or, will this approach just not work?

Please update the mglearn library, that should solve the issue.