adversarial-attacks deep-learning resnet

Adversarial attacks on ImageNet models

This project explores adversarial attacks on ImageNet models such as Resnet. Some of the key results and discoveries are briefly presented below. The full report can be found in the notebook, and can be viewed online here.

FGSM
Targeted attacks
Blackbox attacks
Targeted blackbox attacks
Universal targeted attacks
References

FGSM

By using the fast gradient sign method (FGSM) described on page 3 of (Szegedy et al, 2013), we can quickly perturb an image of a dog to misguide RESNET-50 into thinking it's something completely different.

Targeted attacks

To do target attacks, we use gradient descent on

$$ \underset{\mathbf{\eta}}{\mathrm{argmin}} ;J(\theta,\mathbf{x}+\mathbf{\eta},\textrm{target label}), \textrm{ subject to 'small' } \eta, $$

where $J$ is the cost function used to train RESNET-50 and $\epsilon$ is a hyperparameter for the amount of noise. With this, we can for instance make the Border Collie dog seems like a sea slug:

Blackbox attacks

We can also perform blackbox attacks on the model (i.e. only using the forward pass of the network) by iteratively choosing a random direction $\eta \sim \mathcal{N}(\mathbf{0}, \sigma)$ and updating the noise on the image if this increases the model's loss. Doing this for just 200 iterations makes the model missclassify the dog breed, and after 10 000 iterations, the dog is perceived as toilet paper:

Targeted blackbox attacks

To do targeted blackbox attacks, we use a similar method to the above, but optimise with a target label in place instead of just maximising the loss. As an example, I have done this with the dog image and the label sea slug. To the left, I have plotted the prediction on the final image (20 000 iterations) and on the right, we see how the predicition converges to sea slug and the correct prediction decreases in probability.

Prediction	Convergence

Universal targeted attacks

A universal attack is when we find a single noise vector and apply it to several different images to misguide to model. To achieve this, I downloaded a subset of the ImageNet datasets, and trained a model to optimize the following:

$$ \underset{\mathbf{\eta}}{\mathrm{argmin}} \frac{1}{n} \sum_{(\mathbf{x},l) \in \mathbf{X}} J(\theta, \mathbf{x}+\mathbf{\eta},\textrm{target label}), $$

where $\mathbf{X}$ is a minibatch of training data (tuples of an image and the corresponding label). I held out 12 000 test images and applied the noise derived during training to all of them. I then looked at whether "sea slug" was the top prediction or among the top-5 predictions on these images. As we see below, the model incorrectly thinks over 90% of the 12 000 images are sea slugs, even if the noise is the exact same on all of these images.

Model	Loss	Top-1 Success rate	Top-5 Success rate
Resnet50	15.683664	0.915417	0.958417

We can also visualise how this unviversal noise looks like by applying it to a dog and a panda:

Dog	Panda

References

Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian (2014). "Explaining and Harnessing Adversarial Examples", arXiv, https://doi.org/10.48550/arxiv.1412.6572

About

Explores adversarial attacks on ImageNet. Part of a Deep Neural Networks assignment at the University of Cambridge.

adversarial-attacks deep-learning resnet

Languages

Language:Jupyter Notebook 100.0%