interpretml / interpret-community

Interpret Community extends Interpret repository with additional interpretability techniques and utility functions to handle real-world datasets and workflows.

Home Page:https://interpret-community.readthedocs.io/en/latest/index.html

Repository from Github https://github.cominterpretml/interpret-communityRepository from Github https://github.cominterpretml/interpret-community

Replace load_boston with alternate regression dataset

gaugup opened this issue · comments

Describe the bug
As noted in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html the Boston housing dataset is being deprecated due to a significant ethical concern:

Warning: The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable “B” assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.

Sklearn suggests that either the California housing dataset or the Ames housing dataset are reasonable alternative datasets [https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html]. We should change references in interpret-community accordingly.

To Reproduce
Steps to reproduce the behavior:

  1. Search for load_boston in interpret-community repository

Expected behavior
Sklearn suggests that either the California housing dataset or the Ames housing dataset are reasonable alternative datasets [https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html]. We should change references in interpret-community accordingly.