marcusGH / adversarial-attacks-on-imagenet-models

Explores adversarial attacks on ImageNet. Part of a Deep Neural Networks assignment at the University of Cambridge.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adversarial attacks on ImageNet models

This project explores adversarial attacks on ImageNet models such as Resnet. Some of the key results and discoveries are briefly presented below. The full report can be found in the notebook, and can be viewed online here.

FGSM

By using the fast gradient sign method (FGSM) described on page 3 of (Szegedy et al, 2013), we can quickly perturb an image of a dog to misguide RESNET-50 into thinking it's something completely different.

image

Targeted attacks

To do target attacks, we use gradient descent on

$$ \underset{\mathbf{\eta}}{\mathrm{argmin}} ;J(\theta,\mathbf{x}+\mathbf{\eta},\textrm{target label}), \textrm{ subject to 'small' } \eta, $$

where $J$ is the cost function used to train RESNET-50 and $\epsilon$ is a hyperparameter for the amount of noise. With this, we can for instance make the Border Collie dog seems like a sea slug:

image

Blackbox attacks

We can also perform blackbox attacks on the model (i.e. only using the forward pass of the network) by iteratively choosing a random direction $\eta \sim \mathcal{N}(\mathbf{0}, \sigma)$ and updating the noise on the image if this increases the model's loss. Doing this for just 200 iterations makes the model missclassify the dog breed, and after 10 000 iterations, the dog is perceived as toilet paper:

image

Targeted blackbox attacks

To do targeted blackbox attacks, we use a similar method to the above, but optimise with a target label in place instead of just maximising the loss. As an example, I have done this with the dog image and the label sea slug. To the left, I have plotted the prediction on the final image (20 000 iterations) and on the right, we see how the predicition converges to sea slug and the correct prediction decreases in probability.

Prediction Convergence
image image

Universal targeted attacks

A universal attack is when we find a single noise vector and apply it to several different images to misguide to model. To achieve this, I downloaded a subset of the ImageNet datasets, and trained a model to optimize the following:

$$ \underset{\mathbf{\eta}}{\mathrm{argmin}} \frac{1}{n} \sum_{(\mathbf{x},l) \in \mathbf{X}} J(\theta, \mathbf{x}+\mathbf{\eta},\textrm{target label}), $$

where $\mathbf{X}$ is a minibatch of training data (tuples of an image and the corresponding label). I held out 12 000 test images and applied the noise derived during training to all of them. I then looked at whether "sea slug" was the top prediction or among the top-5 predictions on these images. As we see below, the model incorrectly thinks over 90% of the 12 000 images are sea slugs, even if the noise is the exact same on all of these images.

Model Loss Top-1 Success rate Top-5 Success rate
Resnet50 15.683664 0.915417 0.958417

We can also visualise how this unviversal noise looks like by applying it to a dog and a panda:

Dog Panda
image image

References

Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian (2014). "Explaining and Harnessing Adversarial Examples", arXiv, https://doi.org/10.48550/arxiv.1412.6572

About

Explores adversarial attacks on ImageNet. Part of a Deep Neural Networks assignment at the University of Cambridge.


Languages

Language:Jupyter Notebook 100.0%