Revisiting Robust Interpretability of Self-Explaining Neural Networks

As machine learning arises in various applications, the demand for better understandings of machine learning models' prediction increases. Self-explaining models are models that provide human interpretable explanations. In this study we revisit the approach proposed by Melis & Jaakkola (2018) that incorporates interpretability directly into the structure of the model, resulting in self-explaining neural networks, and we propose an adjustment to their approach for improvement of interpretability. The proposed adjusted approach is measured by comparing it with the original approach based on three desiderata: explicitness, faithfulness and stability. Our work improves in faithfulness, provides a different and possibly more intuitive perspective on intelligibility, but slightly deteriorates stability. The third can be improved by adding a local-Lipschitz stability property for the concepts variable.

This research was performed using the COMPAS dataset, a dataset of recidivism. The accompanying paper can be found in the root folder, and results can be generated using the included Jupyter Notebook.

Authors: Joosje Goedhart, 10738193, joosjegoedhart@gmail.com Lennert Jansen, 10488952, lennertjansen95@gmail.com Hannah Lim, 10588973, hannah_lim@outlook.com Daniel Nobbe, 12891517, daniellnobbe@gmail.com

Teaching Assistant: Simon Passenheim

About

Repository for the UvA 2020 course 'Fairness, Accountability, Confidentiality and Transparancy in AI'

Languages

Language:Python 95.3%Language:Jupyter Notebook 4.7%