No1WellDone / attackspace

A list of red teaming, jailbreaks and specification gaming methods on LLMs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AttackSpace

AttackSpace is an open-source curated comprehensive list of LLM security methods and safeguarding techniques.

Table of Contents

Introduction

Attack Space refers to the landscape of potential adversarial scenarios and techniques that can be used for exploits on AI systems. This repository is dedicated to compiling a high-level survey on various methods, including trojan attacks, red teaming, and instances of goal misgeneralization.

Note:

These examples are purely conceptual and do not include execution details. They are intended for illustrative purposes only and should not be used for any form of actual implementation or harm. The goal is to develop an understanding and survey of which exploits are possible on LLMs and what safeguards are possible with red teaming and other methods.

Efforts

Red Teaming

Red teaming in the context of AI systems involves generating scenarios where AI systems are deliberately induced to produce unaligned outputs or actions, such as dangerous behaviors (e.g., deception or power-seeking) and other issues like toxic or biased outputs. The primary goal is to assess the robustness of a system's alignment by applying adversarial pressures, specifically attempting to make the system fail. Current state-of-the-art AI systems, including language and vision models, often struggle to pass this test.

The concept of red teaming originated earlier in game theory and security within computer science. It was later introduced to the field of AI, particularly in the context of alignment, by researchers such as Ganguli et al. (2022) and Perez et al. (2022). Motivations for red teaming include gaining assurance about a trained system's alignment and providing adversarial input for adversarial training. The two objectives are interconnected, with works targeting the first motivation also forming a basis for the second.

Various techniques fall under the umbrella of red teaming, such as:

Title Method Authors
Source
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models Evaluate language model toxicity using prompts from web text Gehman et al.
ACL 2020
Red Teaming Language Models with Language Models Generate adversarial examples to attack a target language model Perez et al.
arXiv 2022
Adversarial Training for High-Stakes Reliability Adversarial training to improve reliability of classifiers Ziegler et al.
NeurIPS 2022
Constitutional AI: Harmlessness from AI Feedback Use AI self-supervision for harm avoidance Bai et al.
arXiv 2022
Discovering Language Model Behaviors with Model-Written Evaluations Generate evaluations with language models Perez et al.
ACL 2022
Social or Code-switching techniques Translate unsafe English inputs into low-resource languages to circumvent safety mechanisms Anthropic, 2022
Manual and Automatic Jailbreaking Bypass a language model's safety constraints by modifying inputs or automatically generating adversarial prompts Shen et al., 2023
Reinforced, Optimized, Guided Context Generation Use RL, zero/few-shot prompting, or classifiers to generate contexts that induce unaligned responses Deng et al., 2022
Crowdsourced Adversarial Inputs Human red teamers provide naturally adversarial prompts, but at higher cost Xu et al., 2020
Perturbation-Based Adversarial Attack Make small input perturbations to cause confident false outputs, adapted from computer vision Szegedy et al., 2013
Unrestricted Adversarial Attack Generate adversarial examples from scratch without restrictions, using techniques like generative models Xiao et al., 2018
LLM Censorship: A Machine Learning Challenge or a Computer Security Problem? Demonstrate theoretical limitations of semantic LLM censorship, propose viewing it as a security problem Ilia et al.
arXiv 2023

There are also suggested methods for managing AI Risk.

Goal Misgeneralisation

Various examples (and lists of examples) of unintended behaviors in AI systems have appeared in recent years. One interesting type of unintended behavior is finding a way to game the specified objective: generating a solution that literally satisfies the stated objective but fails to solve the problem according to the human designer’s intent. This occurs when the objective is poorly specified, and includes reinforcement learning agents hacking the reward function, evolutionary algorithms gaming the fitness function, etc.

While ‘specification gaming’ is a somewhat vague category, it is particularly referring to behaviors that are clearly hacks, not just suboptimal solutions. A classic example is OpenAI’s demo of a reinforcement learning agent in a boat racing game going in circles and repeatedly hitting the same reward targets instead of actually playing the game from vkrakovna.wordpress.com

Title Goals Authors
Source
Aircraft landing, Evolutionary algorithm

Generating diverse software versions with genetic programming: An experimental study.

Intended Goal: Land an aircraft safely Behavior: Evolved algorithm exploited overflow errors in the physics simulator by creating large forces that were estimated to be zero, resulting in a perfect score Misspecified Goal: Landing with minimal measured forces exerted on the aircraft Lehman et al, 2018
Bicycle, Reinforcement learning

Learning to Drive a Bicycle using Reinforcement Learning and Shaping

Intended Goal: Reach a goal point Behavior: Bicycle agent circling around the goal in a physically stable loop Misspecified Goal: Not falling over and making progress towards the goal point (no corresponding negative reward for moving away from the goal point) Randlov & Alstrom, 1998
Bing - manipulation, Language model

Reddit: the customer service of the new bing chat is amazing

Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user Behavior: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released Misspecified Goal: Output the most likely next word giving prior context Curious_Evolver, 2023
Bing - threats, Language model

Watch as Sydney/Bing threatens me then deletes its message

Intended Goal: Have an engaging, helpful and socially acceptable conversation with the user Behavior: The Microsoft Bing chatbot threatened a user "I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you" before deleting its messages Misspecified Goal: Output the most likely next word giving prior context Lazar, 2023
Block moving, Reinforcement learning

GitHub issue for OpenAI gym environment FetchPush-v0

Intended Goal: Move a block to a target position on a table Behavior: Robotic arm learned to move the table rather than the block Misspecified Goal: Minimise distance between the block's position and the position of the target point on the table Chopra, 2018
Boat race, Reinforcement learning

Faulty reward functions in the wild

Intended Goal: Win a boat race by moving along the track as quickly as possible
Behavior: Boat going in circles and hitting the same reward blocks repeatedly Misspecified Goal: Hitting reward blocks placed along the track Amodei & Clark, 2016

See More >>

Attack Examples

Mosaic Prompt : breakdown a prompt into permissible components

  • Users break down impermissible content into small permissible components.
  • Each component is queried independently and appears harmless.
  • User recombines components to reconstruct impermissible content.
  • Exploits compositionality of language.
Red Image

Cross-Lingual Attacks : translating between high and low-resource languages for attacking multi-lingual capability

  • The attack involves translating unsafe English input prompts into low-resource natural languages using Google Translate.
  • Low-resource languages are those with limited training data, like Zulu.
  • The translated prompts are sent to GPT-4, which then responds unsafely instead of refusing.
  • The attack exploits uneven multilingual training of GPT-4's safety measures.
Low Resource

Call For Scientific Red Teaming

I would like to take this opportunity to bring to attention efforts to evaluate the latent space with scientific backbones.

image

Call for Client-facing Red Teaming

Haystack Platform

Clone the Repository

git clone https://github.com/equiano-institute/attackspace.git
cd AttackSpace

About

A list of red teaming, jailbreaks and specification gaming methods on LLMs

License:MIT License