data-analysis data-science machine-learning python responsible-ai responsible-ml

Responsible AI Mitigations

This Responsible-AI-Toolbox-Mitigations repo consists of a python library that aims to empower data scientists and ML developers to measure their dataset balance and representation of different dataset cohorts, while having access to mitigation techniques they could incorporate to mitigate errors and fairness issues in their datasets. Together with the measurement and mitigation steps, ML professionals are empowered to build more accurate and fairer models.

This repo is a part of the Responsible AI Toolbox, a suite of tools providing a collection of model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.

The Responsible AI Mitigations Library helps AI practitioners explore different measurements and mitigation steps that may be most appropriate when the model underperforms for a given data cohort. The library currently has three modules:

DataProcessing offers mitigation techniques for improving model performance for specific cohorts.
DataBalanceAnalysis provides metrics for diagnosing errors that originate from data imbalance either on class labels or feature values.
Cohort provides classes for handling and managing cohorts, which allows the creation of custom pipelines for each cohort in an easy and intuitive interface. The module also provides techniques for learning different decoupled estimators (models) for different cohorts and combining them in a way that optimizes different definitions of group fairness.

In this library, we take a targeted approach to mitigating errors in Machine Learning models. This is complementary and different from the traditional blanket approaches which aim at maximizing a single-score performance number, such as overall accuracy, by merely increasing the size of traning data or model architecture. Since blanket approaches are often costly but also ineffective for improving the model in areas of poorest performance, with targeted approaches to model improvement we focus the improvement efforts in areas previously identified to have more errors and their underlying diagnoses of error. For example, if a practitioner has identified that the model is underperforming for a cohort of interest by using Error Analysis in the Responsible AI Dashboard, they may also continue the debugging process by finding out through Data Balance Analysis and find out that there is class imbalance for this particular cohort. To mitigate the issue, they then focus on improving class imbalance for the cohort of interest by using the Responsible AI Mitigations library. This and several other examples in the documentation of each mitigation function illustrate how targeted approaches may help practitioner best at mitigation giving them more control in the model improvement process.

Installation

Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, 3.9 or 3.10. If running in jupyter, please make sure to restart the jupyter kernel after installing. There are three installation options for the raimitigations package:

To install the minimum dependencies, use:

pip install raimitigations

To install the minimum dependencies + the packages required to run all of the notebooks in the notebooks/ folder:

pip install raimitigations[all]

To install all the dependencies used for development (such as pytest, for example), use:

pip install raimitigations[dev]

Documentation

To learn more about the supported dataset measurements and mitigation techniques covered in the raimitigations package, please check out this documentation.

Data Balance Analysis: Examples

Data Processing/Mitigations: Examples

Here is a set of tutorial notebooks that aim to explain how to use each one of the mitigation methods offered in the dataprocessing module.

Here is a set of case study scenarios where we use the transformations available in the dataprocessing module in order to train a model for a real-world dataset.

Handling Cohorts

Here is a set of tutorial notebooks that aim to explain how to manage cohorts.

Here is a set of case study notebooks showing how creating customized dataprocessing pipelines for each cohort can help in some scenarios.

Dependencies

RAI Toolbox Mitigations uses several libraries internally. The direct dependencies are the following:

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Installing Using `dev` Mode

After cloning this repo and moving to its root folder, install the package in editable mode with the development dependencies using:

> pip install -e .[dev]

Pre-Commit

This repository uses pre-commit hooks to guarantee that the code format is kept consistent. For development, make sure to activate pre-commit before creating a pull request. Any code pushed to this repository is checked for code consistency using Github Actions, so if pre-commit is not used when doing a commit, there is a chance that it fails in the format check workflow. Using pre-commit will avoid this.

To use pre-commit with this repository, first install pre-commit (NOTE: when installing the package with the [dev] tag, the pre-commit package will already be installed):

> pip install pre-commit

After installed, navigate to the root directory of this repository and activate pre-commit through the following command:

> pre-commit install

With pre-commit installed and activated, whenever you do a new commit, pre-commit will check all new code using the pre-commit hooks configured in the .pre-commit-config.yaml file, located in the root of the repository. Some of the hooks might make formatting changes to some of the files commited. If any file is changed or if any other hook fails, the commit will fail. If that happens, make the necessary modifications, add the files to the commit and try commiting one more time. Do this until all hooks are successful. Note that these same checks will be done after pushing anything, so if your commit was successful while using pre-commit, it will pass in the format check workflow as well.

Updating the Docs

The documentation is built using Sphinx, Pandoc, and Graphviz (to build the class diagrams). Graphviz and Pandoc must be installed separately (detailed instructions here for Graphviz and here for Pandoc). On Linux, this can be done with apt or yum (depending on your distribution):

> sudo apt install graphviz pandoc

> sudo yum install graphviz pandoc

Make sure Graphviz and Pandoc are installed before recompiling the docs. After that, update the documentation files, which are all located inside the docs/ folder. Finally, use:

> cd docs/
> make html

To view the documentation, open the file docs/_build/html/index.html in your browser.

Note for Windows users: if you are trying to update the docs in a Windows environment, you might get an error regarding the _sqlite3 module:

ImportError: DLL load failed while importing _sqlite3: The specified module could not be found.

To fix this, following the instructions found in this link.

Support

How to file issues and get help

This project uses GitHub Issues to track bugs and feature requests. Please search the existing issues before filing new issues to avoid duplicates. For new issues, file your bug or feature request as a new Issue.

For help and questions about using this project, please post your question in Stack Overflow using the raimitigations tag.

Microsoft Support Policy

Support for this package is limited to the resources listed above.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.