Manipulated Geometry on Adversarial Attacks

Papers covered in this repository include,

Overcoming Adversarial Attacks for HITL Applications (pdf, arxiv)
Explanations can be manipulated and Geometry is to blame

Overcoming Adversarial Attacks for Human-in-the-Loop Applications

This repository was used to produce Figure 1 in Overcoming Adversarial Attacks for Human-in-the-Loop Applications and originally was the reference implementation for Explanations can be manipulated and geometry is to blame. The Overcoming paper seeks to show how adversarial samples can be manipulated to show a valid explanation while still remaining adversarial. See the paper for more details.

Install

Install dependencies using

     pip install -r requirements.txt

Usage

Manipulate an image to reproduce a given target explanation using

    python run_attack.py --cuda

For explanations beyond lrp you need to enable beta_growth so the second derivative of the activations is not zero.

    python run_attack.py --cuda --method gradient --beta_growth

Plot softplus expanations for various values of beta using

    python plot_expl.py --cuda

To download patterns for pattern attribution, please use the following link:

https://drive.google.com/open?id=1RdvAiUZgfhSE8sVF2JOyURpnk1HQ_hZk

Copy the downloaded file in the models subdirectory.

Explanations can be manipulated and Geometry is to blame

Explanation methods aim to make neural networks more trustworthy and interpretable. In this paper, we demonstrate a property of explanation methods which is disconcerting for both of these purposes. Namely, we show that explanations can be manipulated \emph{arbitrarily} by applying visually hardly perceptible perturbations to the input that keep the network's output approximately constant. We establish theoretically that this phenomenon can be related to certain geometrical properties of neural networks. This allows us to derive an upper bound on the susceptibility of explanations to manipulations. Based on this result, we propose effective mechanisms to enhance the robustness of explanations.

What we do

We manipulate images so their explanation resembles an arbitrary target map. Below you can see our algorithm in action:

In our paper we show how to achieve such manipulations. We discuss their nature and derive an upper bound on how much the explanation can change. Based on this bound we propose β-smoothing, a method that can be applied to any of the considered explanation methods to increase robustness against manipulations.

β-smoothing

We have demonstrated that one can drastically change the explanation map while keeping the output of the neural network constant. We argue that this vulnerability can be related to the large curvature of the output manifold of the neural network. We focus on the gradient method. The fact that the gradient can be drastically changed by slightly perturbing the input along the hypersurface suggests that the curvature of the hypersurface is large. If we replace the ReLU activations with softplus activations with parameter β, and reduce β we can reduce the curvature of the lines of equal network output. Below you can see the smoothing in action for a two layer neural network.