This Responsible-AI-Toolbox-Mitigations repo consists of a python library that aims to empower data scientists and ML developers to measure their dataset balance and representation of different dataset cohorts, while having access to mitigation techniques they could incorporate to mitigate errors and fairness issues in their datasets. Together with the measurement and mitigation steps, ML professionals are empowered to build more accurate and fairer models.
This repo is a part of the Responsible AI Toolbox, a suite of tools providing a collection of model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.
The Responsible AI Mitigations Library helps AI practitioners explore different measurements and mitigation steps that may be most appropriate when the model underperforms for a given data cohort. The library currently has three modules:
- DataProcessing offers mitigation techniques for improving model performance for specific cohorts.
- DataBalanceAnalysis provides metrics for diagnosing errors that originate from data imbalance either on class labels or feature values.
- Cohort provides classes for handling and managing cohorts, which allows the creation of custom pipelines for each cohort in an easy and intuitive interface. The module also provides techniques for learning different decoupled estimators (models) for different cohorts and combining them in a way that optimizes different definitions of group fairness.
In this library, we take a targeted approach to mitigating errors in Machine Learning models. This is complementary and different from the traditional blanket approaches which aim at maximizing a single-score performance number, such as overall accuracy, by merely increasing the size of traning data or model architecture. Since blanket approaches are often costly but also ineffective for improving the model in areas of poorest performance, with targeted approaches to model improvement we focus the improvement efforts in areas previously identified to have more errors and their underlying diagnoses of error. For example, if a practitioner has identified that the model is underperforming for a cohort of interest by using Error Analysis in the Responsible AI Dashboard, they may also continue the debugging process by finding out through Data Balance Analysis and find out that there is class imbalance for this particular cohort. To mitigate the issue, they then focus on improving class imbalance for the cohort of interest by using the Responsible AI Mitigations library. This and several other examples in the documentation of each mitigation function illustrate how targeted approaches may help practitioner best at mitigation giving them more control in the model improvement process.
Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, 3.9 or 3.10. If running in jupyter, please make sure to restart the jupyter kernel after installing. There are three installation options for the raimitigations
package:
- To install the minimum dependencies, use:
pip install raimitigations
- To install the minimum dependencies + the packages required to run all of the notebooks in the
notebooks/
folder:
pip install raimitigations[all]
- To install all the dependencies used for development (such as
pytest
, for example), use:
pip install raimitigations[dev]
To learn more about the supported dataset measurements and mitigation techniques covered in the raimitigations package, please check out this documentation.
Here is a set of tutorial notebooks that aim to explain how to use each one of the mitigation methods offered in the dataprocessing module.
- Encoders
- Scalers
- Basic Imputer
- Iterative Imputer
- KNN Imputer
- Sequential Feature Selection
- Feature Selection using Catboost
- Identifying correlated features: tutorial
- Data Rebalance using imblearn
- Data Rebalance using SDV
- Using scikit-learn's Pipeline
Here is a set of case study scenarios where we use the transformations available in the dataprocessing module in order to train a model for a real-world dataset.
Here is a set of tutorial notebooks that aim to explain how to manage cohorts.
- Creating Single Cohorts
- Creating Different Pipelines for each Cohort
- Different Pre-processing Scenarios using cohorts
- Using Decoupled Classifiers
Here is a set of case study notebooks showing how creating customized dataprocessing pipelines for each cohort can help in some scenarios.
- Cohort Case Study 1
- Cohort Case Study 1 - Rebalancing only specific cohorts
- Cohort Case Study 1 - Using RAI Toolbox
- Cohort Case Study 2
- Cohort Case Study 3
- Decoupled Classifier Case 1
- Decoupled Classifier Case 2
- Decoupled Classifier Case 3
RAI Toolbox Mitigations uses several libraries internally. The direct dependencies are the following:
- Numpy
- Pandas
- SciPy
- Scikit Learn
- ResearchPY
- Statsmodels
- Imbalanced Learn
- SDV
- CatBoost
- XGBoost
- MLxtend
- UCI Dataset
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
After cloning this repo and moving to its root folder, install the package in editable mode with the development dependencies using:
> pip install -e .[dev]
This repository uses pre-commit hooks to guarantee that the code format is kept consistent. For development, make sure to activate pre-commit before creating a pull request. Any code pushed to this repository is checked for code consistency using Github Actions, so if pre-commit is not used when doing a commit, there is a chance that it fails in the format check workflow. Using pre-commit will avoid this.
To use pre-commit with this repository, first install pre-commit (NOTE: when installing the package with the [dev]
tag, the
pre-commit
package will already be installed):
> pip install pre-commit
After installed, navigate to the root directory of this repository and activate pre-commit through the following command:
> pre-commit install
With pre-commit installed and activated, whenever you do a new commit, pre-commit will check all new code using the pre-commit hooks configured in the .pre-commit-config.yaml file, located in the root of the repository. Some of the hooks might make formatting changes to some of the files commited. If any file is changed or if any other hook fails, the commit will fail. If that happens, make the necessary modifications, add the files to the commit and try commiting one more time. Do this until all hooks are successful. Note that these same checks will be done after pushing anything, so if your commit was successful while using pre-commit, it will pass in the format check workflow as well.
The documentation is built using Sphinx, Pandoc, and Graphviz (to build the class diagrams). Graphviz and Pandoc must be installed separately (detailed instructions here for Graphviz and here for Pandoc). On Linux, this can be done with apt
or yum
(depending on your distribution):
> sudo apt install graphviz pandoc
> sudo yum install graphviz pandoc
Make sure Graphviz and Pandoc are installed before recompiling the docs. After that, update the documentation files, which are all located inside the docs/
folder. Finally, use:
> cd docs/
> make html
To view the documentation, open the file docs/_build/html/index.html
in your browser.
Note for Windows users: if you are trying to update the docs in a Windows environment, you might get an error regarding the _sqlite3 module:
ImportError: DLL load failed while importing _sqlite3: The specified module could not be found.
To fix this, following the instructions found in this link.
This project uses GitHub Issues to track bugs and feature requests. Please search the existing issues before filing new issues to avoid duplicates. For new issues, file your bug or feature request as a new Issue.
For help and questions about using this project, please post your question in Stack Overflow using the raimitigations
tag.
Support for this package is limited to the resources listed above.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Current Maintainers: Marah Abdin, Matheus Mendonça, Dany Rouhana, Mark Encarnación
Past Maintainers: Akshara Ramakrishnan, Irina Spiridonova
Research Contributors: Besmira Nushi, Rahee Ghosh Peshawaria, Ece Kamar