NL explanations

Intro (15 minutes)

Sometimes an attention map or a collection of examples isn't enough: need more explicit information about abstraction in a learned model
Examples: relational features, black dot in image classifier
Challenge: hard to hang on to interpretability techniques with precise formal characterization while generating more complex explanations.
Today: work through series of models for generatinig natural language explanations (some of which are terrible as tools for interpretability in their own right) to try to get the best of both worlds.

Train a model on representations in a learned image classifier. In what sense should we expect these to be truthful? Do they actually tell the truth?

Train a model with an extra term requiring discriminability of the image in question. Does this fix the problem? What about doctored datasets?

Can we use this technique to identify "primitive" representations for high-level features like color or shape? Do these have nice geometric interpretations?

Exampeles of things that are hard to get at with less powerful interpretation modalities: esp composition.
Future work: large-scale studies like in dissection papers