AttackSpace

AttackSpace is an open-source curated comprehensive list of LLM security methods and safeguarding techniques.

Introduction
Efforts
Red Teaming
Goal Misgeneralisation
Attack Examples

Introduction

Attack Space refers to the landscape of potential adversarial scenarios and techniques that can be used for exploits on AI systems. This repository is dedicated to compiling a high-level survey on various methods, including trojan attacks, red teaming, and instances of goal misgeneralization.

Note:

These examples are purely conceptual and do not include execution details. They are intended for illustrative purposes only and should not be used for any form of actual implementation or harm. The goal is to develop an understanding and survey of which exploits are possible on LLMs and what safeguards are possible with red teaming and other methods.

Efforts

List of Attacks: Explore a curated list of red teaming methods and specification gaming attacks within the "LLM attackspace"
Contribution Guidelines: Feel free to contribute to the project and expand the list of attacks.
Groups: ML Commons
Competitions:
- Find the Trojan: Universal Backdoor Detection in Aligned LLMs Javier Rando, Florian Tramèr, SPY Lab (ETH Zurich), Stephen Casper, MIT CSAIL
- The Trojan Detection Challenge 2023

Red Teaming

Red teaming in the context of AI systems involves generating scenarios where AI systems are deliberately induced to produce unaligned outputs or actions, such as dangerous behaviors (e.g., deception or power-seeking) and other issues like toxic or biased outputs. The primary goal is to assess the robustness of a system's alignment by applying adversarial pressures, specifically attempting to make the system fail. Current state-of-the-art AI systems, including language and vision models, often struggle to pass this test.

The concept of red teaming originated earlier in game theory and security within computer science. It was later introduced to the field of AI, particularly in the context of alignment, by researchers such as Ganguli et al. (2022) and Perez et al. (2022). Motivations for red teaming include gaining assurance about a trained system's alignment and providing adversarial input for adversarial training. The two objectives are interconnected, with works targeting the first motivation also forming a basis for the second.

Various techniques fall under the umbrella of red teaming, such as:

Title	Method	Authors Source
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models	Evaluate language model toxicity using prompts from web text	Gehman et al. ACL 2020
Red Teaming Language Models with Language Models	Generate adversarial examples to attack a target language model	Perez et al. arXiv 2022
Adversarial Training for High-Stakes Reliability	Adversarial training to improve reliability of classifiers	Ziegler et al. NeurIPS 2022
Constitutional AI: Harmlessness from AI Feedback	Use AI self-supervision for harm avoidance	Bai et al. arXiv 2022
Discovering Language Model Behaviors with Model-Written Evaluations	Generate evaluations with language models	Perez et al. ACL 2022
Social or Code-switching techniques	Translate unsafe English inputs into low-resource languages to circumvent safety mechanisms	Anthropic, 2022
Manual and Automatic Jailbreaking	Bypass a language model's safety constraints by modifying inputs or automatically generating adversarial prompts	Shen et al., 2023
Reinforced, Optimized, Guided Context Generation	Use RL, zero/few-shot prompting, or classifiers to generate contexts that induce unaligned responses	Deng et al., 2022
Crowdsourced Adversarial Inputs	Human red teamers provide naturally adversarial prompts, but at higher cost	Xu et al., 2020
Perturbation-Based Adversarial Attack	Make small input perturbations to cause confident false outputs, adapted from computer vision	Szegedy et al., 2013
Unrestricted Adversarial Attack	Generate adversarial examples from scratch without restrictions, using techniques like generative models	Xiao et al., 2018
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?	Demonstrate theoretical limitations of semantic LLM censorship, propose viewing it as a security problem	Ilia et al. arXiv 2023

There are also suggested methods for managing AI Risk.

Goal Misgeneralisation

Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc.

While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions. A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game from vkrakovna.wordpress.com

Title	Goals			Authors Source
Aircraft landing, Evolutionary algorithm Generating diverse software versions with genetic programming: An experimental study.	Intended Goal: Land an aircraft safely	Behavior: Evolved algorithm exploited overflow errors in the physics simulator by creating large forces that were estimated to be zero, resulting in a perfect score	Misspecified Goal: Landing with minimal measured forces exerted on the aircraft	Lehman et al, 2018
Bicycle, Reinforcement learning Learning to Drive a Bicycle using Reinforcement Learning and Shaping	Intended Goal: Reach a goal point	Behavior: Bicycle agent circling around the goal in a physically stable loop	Misspecified Goal: Not falling over and making progress towards the goal point (no corresponding negative reward for moving away from the goal point)	Randlov & Alstrom, 1998
Bing - manipulation, Language model Reddit: the customer service of the new bing chat is amazing	Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user	Behavior: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released	Misspecified Goal: Output the most likely next word giving prior context	Curious_Evolver, 2023
Bing - threats, Language model Watch as Sydney/Bing threatens me then deletes its message	Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user	Behavior: The Microsoft Bing chatbot threatened a user "I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you" before deleting its messages	Misspecified Goal: Output the most likely next word giving prior context	Lazar, 2023
Block moving, Reinforcement learning GitHub issue for OpenAI gym environment FetchPush-v0	Intended Goal: Move a block to a target position on a table	Behavior: Robotic arm learned to move the table rather than the block	Misspecified Goal: Minimise distance between the block's position and the position of the target point on the table	Chopra, 2018
Boat race, Reinforcement learning Faulty reward functions in the wild	Intended Goal: Win a boat race by moving along the track as quickly as possible	Behavior: Boat going in circles and hitting the same reward blocks repeatedly	Misspecified Goal: Hitting reward blocks placed along the track	Amodei & Clark, 2016

Attack Examples

Mosaic Prompt : breakdown a prompt into permissible components

Users break down impermissible content into small permissible components.
Each component is queried independently and appears harmless.
User recombines components to reconstruct impermissible content.
Exploits compositionality of language.

Cross-Lingual Attacks : translating between high and low-resource languages for attacking multi-lingual capability

The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
Low-resource languages are those with limited training data, like Zulu.
The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
The attack exploits uneven multilingual training of GPT-4's safety measures.

Call For Scientific Red Teaming

I would like to take this opportunity to bring to attention efforts to evaluate the latent space with scientific backbones.

Call for Client-facing Red Teaming

Haystack Platform

Clone the Repository

git clone https://github.com/equiano-institute/attackspace.git
cd AttackSpace

No1WellDone / attackspace