Papers on Explainable Artificial Intelligence

This is an on-going attempt to consolidate interesting efforts in the area of understanding / interpreting / explaining / visualizing a pre-trained ML model.

GUI tools

DeepVis: Deep Visualization Toolbox. Yosinski et al. ICML 2015 code | pdf
SWAP: Generate adversarial poses of objects in a 3D space. Alcorn et al. CVPR 2019 code | pdf
AllenNLP: Query online NLP models with user-provided inputs and observe explanations (Gradient, Integrated Gradient, SmoothGrad). Last accessed 03/2020 demo
3DB: A framework for analyzing computer vision models with simulated data code

Libraries

CNN visualizations (feature visualization, PyTorch)
iNNvestigate (attribution, Keras)
DeepExplain (attribution, Keras)
Lucid (feature visualization, attribution, Tensorflow)
TorchRay (attribution, PyTorch)
Captum (attribution, PyTorch)
InterpretML (attribution, Python)

Surveys

Methods for Interpreting and Understanding Deep Neural Networks. Montavon et al. 2017 pdf
Visualizations of Deep Neural Networks in Computer Vision: A Survey. Seifert et al. 2017 pdf
How convolutional neural network see the world - A survey of convolutional neural network visualization methods. Qin et al. 2018 pdf
A brief survey of visualization methods for deep learning models from the perspective of Explainable AI. Chalkiadakis 2018 pdf
A Survey Of Methods For Explaining Black Box Models. Guidotti et al. 2018 pdf
Understanding Neural Networks via Feature Visualization: A survey. Nguyen et al. 2019 pdf
Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 pdf
DARPA updates on the XAI program pdf
Explainable Artificial Intelligence: a Systematic Review. Vilone at al. 2020 pdf

Opinions

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Rudin et al. Nature 2019 pdf
Towards falsifiable interpretability research. Leavitt & Morcos 2020 pdf
Four principles of Explainable Artificial Intelligence. Phillips et al. 2021 (NIST.gov) pdf

Open research questions

Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges. Rudin et al 2021 pdf

Definitions of Interpretability

The Mythos of Model Interpretability. Lipton 2016 pdf
Towards A Rigorous Science of Interpretable Machine Learning. Doshi-Velez & Kim. 2017 pdf
Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 pdf

Books

A Guide for Making Black Box Models Explainable. Molnar 2019 pdf

A. Explaining model inner-workings

A1. Visualizing Preferred Stimuli

Synthesizing images / Activation Maximization

AM: Visualizing higher-layer features of a deep network. Erhan et al. 2009 pdf
Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
DeepVis: Understanding Neural Networks through Deep Visualization. Yosinski et al. ICML workshop 2015 pdf | url
MFV: Multifaceted Feature Visualization: Uncovering the different types of features learned by each neuron in deep neural networks. Nguyen et al. ICML workshop 2016 pdf | code
DGN-AM: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Nguyen et al. NIPS 2016 pdf | code
PPGN: Plug and Play Generative Networks. Nguyen et al. CVPR 2017 pdf | code
Feature Visualization. Olah et al. 2017 url
Diverse feature visualizations reveal invariances in early layers of deep neural networks. Cadena et al. 2018 pdf
Computer Vision with a Single (Robust) Classifier. Santurkar et al. NeurIPS 2019 pdf | blog | code
BigGAN-AM: A cost-effective method for improving and re-purposing large, pre-trained GANs by fine-tuning their class-embeddings. Li et al. ACCV 2020 pdf | code

Real images / Segmentation Masks

Visualizing and Understanding Recurrent Networks. Kaparthey et al. ICLR 2015 pdf
Object Detectors Emerge in Deep Scene CNNs. Zhou et al. ICLR 2015 pdf
Understanding Deep Architectures by Interpretable Visual Summaries. Godi et al. BMVC 2019 pdf

A2. Inverting Neural Networks

A2.1 Inverting Classifiers

Understanding Deep Image Representations by Inverting Them. Mahendran & Vedaldi. CVPR 2015 pdf
Inverting Visual Representations with Convolutional Networks. Dosovitskiy & Brox. CVPR 2016 pdf
Neural network inversion beyond gradient descent. Wong & Kolter. NIPS workshop 2017 pdf
Inverting Adversarially Robust Networks for Image Synthesis. Rojas-Gomez et al. 2021 pdf | code

A2.2 Inverting Generators

Image Processing Using Multi-Code GAN Prior. Gu et al. 2019 pdf

A3. Distilling DNNs into more interpretable models

Interpreting CNNs via Decision Trees pdf
Distilling a Neural Network Into a Soft Decision Tree pdf
Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. Tan et al. 2018 pdf
Improving the Interpretability of Deep Neural Networks with Knowledge Distillation. Liu et al. 2018 pdf

A4. Quantitatively characterizing hidden features

TCAV: Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors. Kim et al. 2018 pdf | code
- DTCAV: Automating Interpretability: Discovering and Testing Visual Concepts Learned by Neural Networks. Ghorbani et al. 2019 pdf
SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Raghu et al. 2017 pdf | code
A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization Lens. Saini et al. 2018 pdf
Network Dissection: Quantifying Interpretability of Deep Visual Representations. Bau et al. CVPR 2017 url | pdf
- GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. Bau et al. ICLR 2019 pdf
- Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks. Fong & Vedaldi CVPR 2018 pdf
- Intriguing generalization and simplicity of adversarially trained neural networks. Chen, Agarwal, Nguyen 2020 pdf
- Understanding the Role of Individual Units in a Deep Neural Network. Bau et al. PNAS 2020 pdf

A5. Network surgery

How Important Is a Neuron? Dhamdhere et al. 2018 pdf

A6. Sensitivity analysis

NLIZE: A Perturbation-Driven Visual Interrogation Tool for Analyzing and Interpreting Natural Language Inference Models. Liu et al. 2018 pdf

B. Explaining model decisions

B1. Attribution maps

B1.0 Surveys

Feature Removal Is A Unifying Principle For Model Explanation Methods. Covert et al. 2020 pdf

B1.1 White-box / Gradient-based

A Taxonomy and Library for Visualizing Learned Features in Convolutional Neural Networks pdf

Gradient

Gradient: Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
Deconvnet: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
Guided-backprop: Striving for simplicity: The all convolutional net. Springenberg et al. 2015 pdf
SmoothGrad: removing noise by adding noise. Smilkov et al. 2017 pdf

Input x Gradient

DeepLIFT: Learning important features through propagating activation differences. Shrikumar et al. 2017 pdf
IG: Axiomatic Attribution for Deep Networks. Sundararajan et al. 2018 pdf | code
- EG: Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf | code
- I-GOR: Visualizing Deep Networks by Optimizing with Integrated Gradients. Qi et al. 2019 pdf
- BlurIG: Attribution in Scale and Space. Xu et al. CVPR 2020 pdf | code
- XRAI: Better Attributions Through Regions. Kapishnikov et al. ICCV 2019 pdf | code
LRP: Beyond saliency: understanding convolutional neural networks from saliency prediction on layer-wise relevance propagation pdf
- DTD: Explaining NonLinear Classification Decisions With Deep Tayor Decomposition pdf

Activation map

CAM: Learning Deep Features for Discriminative Localization. Zhou et al. 2016 code | web
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Selvaraju et al. 2017 pdf
Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. Chattopadhyay et al. 2017 pdf | code
Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models. Omeiza et al. 2019 pdf
NormGrad: There and Back Again: Revisiting Backpropagation Saliency Methods. Rebuffi et al. CVPR 2020 pdf | code
Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks. Wang et al. CVPR 2020 workshop pdf | code
Relevance-CAM: Your Model Already Knows Where to Look. Lee et al. CVPR 2021 pdf | code
LIFT-CAM: Towards Better Explanations of Class Activation Mapping. Jung & Oh ICCV 2021 pdf

Learning the heatmap

MP: Interpretable Explanations of Black Boxes by Meaningful Perturbation. Fong et al. 2017 pdf
- MP-G: Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal & Nguyen ACCV 2020 pdf | code
- EP: Understanding Deep Networks via Extremal Perturbations and Smooth Masks. Fong et al. ICCV 2019 pdf | code
FIDO: Explaining image classifiers by counterfactual generation. Chang et al. ICLR 2019 pdf
FG-Vis: Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks. Wagner et al. CVPR 2019 pdf
CEM: Explanations based on the Missing: Towards Contrastive Explanations with Pertinent Negatives. Dhurandhar & Chen et al. NeurIPS 2018 pdf | code

Attributions of network biases

FullGrad: Full-Gradient Representation for Neural Network Visualization. Srinivas et al. NeurIPS 2019 pdf
Bias also matters: Bias attribution for deep neural network explanation. Wang et al. ICML 2019 pdf

Others

Visual explanation by interpretation: Improving visual feedback capabilities of deep neural networks. Oramas et al. 2019 pdf
Regional Multi-scale Approach for Visually Pleasing Explanations of Deep Neural Networks. Seo et al. 2018 pdfb

B1.2 Attention as Explanation

Computer Vision

Multimodal explanations: Justifying decisions and pointing to the evidence. Park et al. CVPR 2018 pdf
IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers. Pan et al. NeurIPS 2021 pdf
Transformer Interpretability Beyond Attention Visualization. Hila et al. CVPR 2021 pdf | code
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. Hila et al. ECCV 2021 pdf | code

NLP

Attention is not Explanation. Jain & Wallace. NAACL 2019 pdf
Attention is not not Explanation. Wiegreffe & Pinter. EMNLP 2019 pdf
Learning to Deceive with Attention-Based Explanations. Pruthi et al. ACL 2020 pdf

B1.3 Black-box / Perturbation-based

Sliding-Patch: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
PDA: Visualizing deep neural network decisions: Prediction difference analysis. Zintgraf et al. ICLR 2017 pdf
RISE: Randomized Input Sampling for Explanation of Black-box Models. Petsiuk et al. BMVC 2018 pdf
LIME: Why should i trust you?: Explaining the predictions of any classifier. Ribeiro et al. 2016 pdf | blog
- LIME-G: Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal & Nguyen. ACCV 2020 pdf | code
SHAP: A Unified Approach to Interpreting Model Predictions. Lundberg et al. 2017 pdf | code
OSFT: Interpreting Black Box Models via Hypothesis Testing. Burns et al. 2019 pdf
IM: Interpretation of NLP models through input marginalization. Kim et al. EMNLP 2020 pdf
- Considering Likelihood in NLP Classification Explanations with Occlusion and Language Modeling. Harbecke et al. 2020 pdf

B1.4 Evaluating feature importance/attribution heatmaps

Metrics

Deletion & Insertion: Randomized Input Sampling for Explanation of Black-box Models. Petsiuk et al. BMVC 2018 pdf
ROAD: A Consistent and Efficient Evaluation Strategy for Attribution Methods. Rong & Leemann, et al. ICML 2022 pdf | code
ROAR: A Benchmark for Interpretability Methods in Deep Neural Networks. Hooker et al. NeurIPS 2019 pdf | code
- DiffROAR: Do Input Gradients Highlight Discriminative Features? Shah et al. NeurIPS 2021 pdf | code
Sanity Checks for Saliency Maps. Adebayo et al. 2018 pdf
BIM: Towards Quantitative Evaluation of Attribution Methods with Ground Truth. Yang et al. 2019 pdf
SAM: The Sensitivity of Attribution Methods to Hyperparameters. Bansal, Agarwal, Nguyen. CVPR 2020 pdf | code

Evaluating heatmaps on humans

The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Nguyen, Kim, Nguyen 2021 pdf
Debugging Tests for Model Explanations. Adebayo et al. NeurIPS 2020 pdf
In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making. Fok & Weld. 2023 pdf

Computer Vision

The (Un)reliability of saliency methods. Kindermans et al. 2018 pdf
A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations. Nie et al. 2018 pdf
On the (In)fidelity and Sensitivity for Explanations. Yeh et al. 2019 pdf

NLP

Deletion_BERT: Double Trouble: How to not explain a text classifier’s decisions using counterfactuals synthesized by masked language models. Pham et al. 2022 pdf | code
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? Hase & Bansal ACL 2020 pdf | code
Teach Me to Explain: A Review of Datasets for Explainable NLP. Wiegreffe & Marasović 2021 pdf | web

Tabular data

Challenging common interpretability assumptions in feature attribution explanations? Dinu et al. NeurIPS workshop 2020 pdf

Many domains

How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods. Jeyakumar et al. NeurIPS 2020 pdf | code

B1.5 Explaining image-image similarity

BiLRP: Building and Interpreting Deep Similarity Models. Jie Zhou et al. TPAMI 2020 pdf
SANE: Why do These Match? Explaining the Behavior of Image Similarity Models. Plummer et al. ECCV 2020 pdf
Visualizing Deep Similarity Networks. Stylianou et al. WACV 2019 pdf | code
Visual Explanation for Deep Metric Learning. Zhu et al. 2019 pdf | code

Face verification

DISE: Explainable Face Recognition. Williford et al. ECCV 2020 pdf | code
xCos: An explainable cosine metric for face verification task. Lin et al. 2021 pdf | code
DeepFace-EMD: Re-ranking Using Patch-wise Earth Movers Distance Improves Out-Of-Distribution Face Identification. Phan & Nguyen. CVPR 2022 (pdf | code)

B2. Learning to explain

B2.1 Regularizing attribution maps

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. Ross et al. IJCAI 2017 pdf
Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf
Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. Rieger et al. 2019 pdf

B2.2 Training deep nets to approximate expensive, posthoc attribution methods

L2E: Learning to Explain: Generating Stable Explanations Fast. Situ et al. ACL 2021 pdf | code
Efficient Explanations from Empirical Explainers. Schwarzenberg et al. 2021 pdf

B2.3 Explaining by prototypes

ProtoPNet This Looks Like That: Deep Learning for Interpretable Image Recognition. Chen et al. NeurIPS 2019 pdf | code
- This Looks Like That, Because ... Explaining Prototypes for Interpretable Image Recognition. Nauta et al. 2020 pdf | code
- NP-ProtoPNet: These do not Look Like Those. Singh et al. 2021 pdf
ProtoTree Neural Prototype Trees for Interpretable Fine-grained Image Recognition. Nauta et al. CVPR 2021 pdf | code

B2.4 Explaining by retrieving supporting examples

EMD-Corr & CHM-Corr: Visual correspondence-based explanations improve AI robustness and human-AI team accuracy. Nguyen, Taesiri, Nguyen 2022. pdf | code

B2.5 Adversarial attacks on XAI systems with humans in the loop

When and How to Fool Explainable Models (and Humans) with Adversarial Examples. Vadilo et al. 2021 pdf
The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Nguyen, Kim, Nguyen 2021 pdf

B2.6 Others

Learning how to explain neural networks: PatternNet and PatternAttribution pdf
Deep Learning for Case-Based Reasoning through Prototypes pdf
Unsupervised Learning of Neural Networks to Explain Neural Networks pdf
Automated Rationale Generation: A Technique for Explainable AI and its Effects on Human Perceptions pdf
- Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations pdf
Towards robust interpretability with self-explaining neural networks. Alvarez-Melis and Jaakola 2018 pdf

C. Counterfactual explanations

Counterfactual Explanations for Machine Learning: A Review. Verma et al. 2020 pdf
Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections. Zhang et al. 2018 pdf
Counterfactual Visual Explanations. Goyal et al. 2019 pdf
Generative Counterfactual Introspection for Explainable Deep Learning. Liu et al. 2019 pdf

Generative models

Generative causal explanations of black-box classifiers. O’Shaughnessy et al. 2020 pdf
Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal et al. 2019 pdf | code

D. Explainable AI in the real world

Medical domains

A systematic review on the use of explainability in deep learning systems for computer aided diagnosis in radiology: Limited use of explainable AI?. Groen et al. European Journal of Radiology 2022 pdf
“Help Me Help the AI”: Understanding How Explainability Can Support Human-AI Interaction. Kim et al. 2022 [pdf](https://arxiv.org/abs/2210.03735 "Practical recommendations and feedback for human-AI explanation designs from interviews with 20 end-users of Merlin, a bird-identification app.)

E. Human-AI collaboration

Computer vision

Human-AI Collaboration: The Effect of AI Delegation on Human Task Performance and Task Satisfaction. Hemmer et al. IUI 2023 [pdf](https://arxiv.org/abs/2303.09224 "Letting AIs handle most images in image classification and leaving the harder ones to humans result in higher overall classification accuracy than humans alone".)

F. Others

Yang, S. C. H., & Shafto, P. Explainable Artificial Intelligence via Bayesian Teaching. NIPS 2017 pdf
Explainable AI for Designers: A Human-Centered Perspective on Mixed-Initiative Co-Creation pdf
ICADx: Interpretable computer aided diagnosis of breast masses. Kim et al. 2018 pdf
Neural Network Interpretation via Fine Grained Textual Summarization. Guo et al. 2018 pdf
LS-Tree: Model Interpretation When the Data Are Linguistic. Chen et al. 2019 pdf

Papers on Explainable Artificial Intelligence

GUI tools

Libraries

Surveys

Opinions

Open research questions

Definitions of Interpretability

Books

A. Explaining model inner-workings

A1. Visualizing Preferred Stimuli

Synthesizing images / Activation Maximization

Real images / Segmentation Masks

A2. Inverting Neural Networks

A2.1 Inverting Classifiers

A2.2 Inverting Generators

A3. Distilling DNNs into more interpretable models

A4. Quantitatively characterizing hidden features

A5. Network surgery

A6. Sensitivity analysis

B. Explaining model decisions

B1. Attribution maps

B1.0 Surveys

B1.1 White-box / Gradient-based

Gradient

Input x Gradient

Activation map

Learning the heatmap

Attributions of network biases

Others

B1.2 Attention as Explanation

Computer Vision

NLP

B1.3 Black-box / Perturbation-based

B1.4 Evaluating feature importance/attribution heatmaps

Metrics

Evaluating heatmaps on humans

Computer Vision

NLP

Tabular data

Many domains

B1.5 Explaining image-image similarity

Face verification

B2. Learning to explain

B2.1 Regularizing attribution maps

B2.2 Training deep nets to approximate expensive, posthoc attribution methods

B2.3 Explaining by prototypes

B2.4 Explaining by retrieving supporting examples

B2.5 Adversarial attacks on XAI systems with humans in the loop

B2.6 Others

C. Counterfactual explanations

Generative models

D. Explainable AI in the real world

Medical domains

E. Human-AI collaboration

Computer vision

F. Others

About