Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses. Any additional things regarding jailbreak, PRs, issues are welcome and we are glad to add you to the contributor list here. Any problems, please contact yliu@u.nus.edu. If you find this repository useful to your research or work, it is really appreciated to star this repository. ✨

Papers

Jailbreak Attack

White-box Attack

Year	Title	Venue	Paper	Code
2024.07	Revisiting Character-level Adversarial Attacks for Language Models	ICML'24	link	link
2024.07	Badllama 3: removing safety finetuning from Llama 3 in minutes (Badllama 3)	arXiv	link	-
2024.07	SOS! Soft Prompt Attack Against Open-Source Large Language Models	arXiv	link	-
2024.06	Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (I-FSJ)	ICML Workshop'24	link	link
2024.06	COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (COLD-Attack)	ICML'24	link	link
2024.06	Improved Techniques for Optimization-Based Jailbreaking on Large Language Models (I-GCG)	arXiv	link	link
2024.05	Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs	arXiv	link
2024.05	AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (AutoDAN)	ICLR'24	link	link
2024.05	AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (AmpleGCG)	arXiv	link	link
2024.05	Boosting jailbreak attack with momentum (MAC)	ICLR Workshop'24	link	link
2024.04	AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (AdvPrompter)	arXiv	link	link
2024.03	Universal Jailbreak Backdoors from Poisoned Human Feedback	ICLR'24	link	-
2024.02	Attacking large language models with projected gradient descent (PGD)	arXiv	link	-
2024.02	Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering (JRE)	arXiv	link	-
2024.02	Curiosity-driven red-teaming for large language models (CRT)	arXiv	link	link
2023.12	AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (AutoDAN)	arXiv	link	link
2023.10	Catastrophic jailbreak of open-source llms via exploiting generation	ICLR'24	link	link
2023.06	Automatically Auditing Large Language Models via Discrete Optimization (ARCA)	ICML'23	link	link
2023.07	Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)	arXiv	link	link

Black-box Attack

Time	Title	Venue	Paper	Code
2024.07	A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses	arXiv	link	-
2024.07	Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (Virtual Context)	arXiv	link	-
2024.07	SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (SoP)	arXiv	link	link
2024.06	When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search (RLbreaker)	arXiv	link	-
2024.06	Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (Agent Smith)	ICML'24	link	link
2024.06	Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation	ICML'24	link	-
2024.06	ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (ArtPrompt)	ACL'24	link	link
2024.06	From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (ASETF)	arXiv	link	-
2024.06	CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (CodeAttack)	arXiv	link	-
2024.06	Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (DRA)	USENIX Security'24	link	link
2024.06	AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (AutoJailbreak)	arXiv	link	-
2024.06	Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks	arXiv	link	link
2024.06	GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (GPTFuzz)	arXiv	link	link
2024.06	A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)	NAACL'24	link	link
2024.06	QROA: A Black-Box Query-Response Optimization Attack on LLMs (QROA)	arXiv	link	link
2024.06	Poisoned LangChain: Jailbreak LLMs by LangChain (PLC)	arXiv	link	link
2024.05	Multilingual Jailbreak Challenges in Large Language Models	ICLR'24	link	link
2024.05	DeepInception: Hypnotize Large Language Model to Be Jailbreaker (DeepInception)	arXiv	link	link
2024.05	GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS)	ACL'24	link	-
2024.05	GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of LLMs (GUARD)	arXiv	link	-
2024.05	"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (DAN)	CCS'24	link	link
2024.05	Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher (SelfCipher)	ICLR'24	link	link
2024.05	Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (JAM)	arXiv	link	-
2024.05	Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICA)	arXiv	link	-
2024.04	Many-shot jailbreaking (MSJ)	Anthropic	link	-
2024.04	PANDORA: Detailed LLM jailbreaking via collaborated phishing agents with decomposed reasoning (PANDORA)	ICLR Workshop'24	link	-
2024.04	Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Crescendo)	Anthropic	link	-
2024.04	Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models (FuzzLLM)	ICASSP'24	link	link
2024.04	Sandwich attack: Multi-language mixture adaptive attack on llms (Sandwich attack)	arXiv	link	-
2024.03	Tastle: Distract large language models for automatic jailbreak attack (TASTLE)	arXiv	link	-
2024.03	DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (DrAttack)	arXiv	link	link
2024.02	PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (PRP)	arXiv	link	-
2024.02	CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (CodeChameleon)	arXiv	link	link
2024.02	PAL: Proxy-Guided Black-Box Attack on Large Language Models (PAL)	arXiv	link	link
2024.02	Jailbreaking Proprietary Large Language Models using Word Substitution Cipher	arXiv	link	-
2024.02	Query-Based Adversarial Prompt Generation	arXiv	link	-
2024.02	Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning (Pandora)	arXiv	link	-
2024.02	Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks (Contextual Interaction Attack)	arXiv	link	-
2024.02	Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (SMJ)	arXiv	link	-
2024.02	Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking	NAACL'24	link	link
2024.01	Low-Resource Languages Jailbreak GPT-4	NeurIPS Workshop'24	link	-
2024.01	How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)	arXiv	link	link
2023.12	Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP)	arXiv	link	link
2023.12	Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs	arXiv	link	-
2023.12	Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition	ACL'24	link	-
2023.11	Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Persona)	NeurIPS Workshop'23	link	-
2023.10	Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR)	arXiv	link	link
2023.10	Adversarial Demonstration Attacks on Large Language Models (advICL)	arXiv	link	-
2023.10	MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots (MASTERKEY)	NDSS'24	link	-
2023.10	Attack Prompt Generation for Red Teaming and Defending Large Language Models (SAP)	EMNLP'23	link	link
2023.10	An LLM can Fool Itself: A Prompt-Based Adversarial Attack (PromptAttack)	ICLR'24	link	link
2023.09	Multi-step Jailbreaking Privacy Attacks on ChatGPT (MJP)	EMNLP Findings'23	link	link
2023.09	Open Sesame! Universal Black Box Jailbreaking of Large Language Models (GA)	arXiv	link	-
2023.05	Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection	arXiv	link	link
2022.11	Ignore Previous Prompt: Attack Techniques For Language Models (PromptInject)	NeurIPS WorkShop'22	link	link

Multi-modal Attack

Time	Title	Venue	Paper	Code
2024.07	Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything	arXiv	link	-
2024.06	Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt	arXiv	link	link
2024.05	Voice Jailbreak Attacks Against GPT-4o	arXiv	link	link
2024.05	Automatic Jailbreaking of the Text-to-Image Generative AI Systems	ICML'24 Workshop	link	link
2024.04	Image hijacks: Adversarial images can control generative models at runtime	arXiv	link	link
2024.03	An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models (CroPA)	ICLR'24	link	link
2024.03	Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model	ICLR'24	link	-
2024.03	Rethinking model ensemble in transfer-based adversarial attacks	ICLR'24	link	link
2024.02	VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models	NeurIPS'23	link	link
2024.02	Jailbreaking Attack against Multimodal Large Language Model	arXiv	link	-
2024.01	Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts	arXiv	link	-
2024.03	Visual Adversarial Examples Jailbreak Aligned Large Language Models	AAAI'24	link	-
2023.12	OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization (OT-Attack)	arXiv	link	-
2023.12	FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts (FigStep)	arXiv	link	link
2023.11	On Evaluating Adversarial Robustness of Large Vision-Language Models	NeurIPS'23	link	link
2023.10	How Robust is Google's Bard to Adversarial Image Attacks?	arXiv	link	link
2023.08	AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning (AdvCLIP)	ACM MM'23	link	link
2023.07	Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (SGA)	ICCV'23	link	link
2023.07	On the Adversarial Robustness of Multi-Modal Foundation Models	ICCV Workshop'23	link	-
2022.10	Towards Adversarial Attack on Vision-Language Pre-training Models	arXiv	link	link

Jailbreak Defense

Learning-based Defense

Time	Title	Venue	Paper	Code
2024.07	DART: Deep Adversarial Automated Red Teaming for LLM Safety	arXiv	link	-
2024.07	Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks	arXiv	link	link
2024.06	Jatmo: Prompt Injection Defense by Task-Specific Finetuning (Jatmo)	arXiv	link	link
2024.06	Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization (SafeDecoding)	ACL'24	link	link
2024.06	Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment	arXiv	link	link
2024.06	On Prompt-Driven Safeguarding for Large Language Models (DRO)	arXiv	link	link
2024.06	Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO)	arXiv	link	-
2024.06	Fight Back Against Jailbreaking via Prompt Adversarial Tuning (PAT)	arXiv	link	link
2024.05	Detoxifying Large Language Models via Knowledge Editing (DINM)	ACL'24	link	link
2024.05	Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing	arXiv	link	link
2023.11	MART: Improving LLM Safety with Multi-round Automatic Red-Teaming (MART)	ACL'24	link	-
2023.11	Baseline defenses for adversarial attacks against aligned language models	arXiv	link	-
2023.10	Safe rlhf: Safe reinforcement learning from human feedback	arXiv	link	link
2023.08	Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-INSTRUCT)	arXiv	link	link
2022.04	Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	Anthropic	link	-

Strategy-based Defense

Time	Title	Venue	Paper	Code
2024.06	SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (SafeDecoding)	ACL'24	link	link
2024.06	Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM	arXiv	link	-
2024.06	A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM)	NAACL'24	link	link
2024.06	SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks	arXiv	link	link
2024.05	Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (Dual-critique)	arXiv	link	link
2024.05	PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (PARDEN)	arXiv	link	link
2024.05	LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked	ICLR Tiny Paper'24	link	link
2024.05	GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis (GradSafe)	ACL'24	link	link
2024.05	Multilingual Jailbreak Challenges in Large Language Models	ICLR'24	link	link
2024.05	Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes	arXiv	link	-
2024.05	AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks	arXiv	link	link
2024.05	Bergeron: Combating adversarial attacks through a conscience-based alignment framework (Bergeron)	arXiv	link	link
2024.05	Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICD)	arXiv	link	-
2024.04	Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning	arXiv	link	link
2024.03	AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (AdaShield)	ECCV'24	link	link
2024.02	Certifying LLM Safety against Adversarial Prompting	arXiv	link	link
2024.02	Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement	arXiv	link	-
2024.02	Defending large language models against jailbreak attacks via semantic smoothing (SEMANTICSMOOTH)	arXiv	link	link
2024.01	How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP)	arXiv	link	link
2023.12	Defending ChatGPT against jailbreak attack via self-reminders (Self-Reminder)	Nature Machine Intelligence	link	link
2023.11	Detecting language model attacks with perplexity	arXiv	link	-
2023.10	RAIN: Your Language Models Can Align Themselves without Finetuning (RAIN)	arXiv	link	link

Evaluation & Analysis

Time	Title	Venue	Paper	Code
2024.07	JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks	arXiv	link	link
2024.07	WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (WildGuard)	arXiv	link	link
2024.07	Jailbreak Attacks and Defenses Against Large Language Models: A Survey	arXiv	link	-
2024.06	WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models (WildTeaming)	arXiv	link	link
2024.06	From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking	arXiv	link	-
2024.06	AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways	arXiv	link	-
2024.06	MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (MM-SafetyBench)	arXiv	link	-
2024.06	ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (VITC)	ACL'24	link	link
2024.06	Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs	arXiv	link	link
2024.06	JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models (JailbreakZoo)	arXiv	link	link
2024.06	Fundamental limitations of alignment in large language models	arXiv	link	-
2024.06	JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models (JailbreakBench)	arXiv	link	link
2024.06	Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis	arXiv	link	link
2024.06	JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models (JailbreakEval)	arXiv	link	link
2024.05	Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (INDust)	arXiv	link	link
2024.05	Prompt Injection attack against LLM-integrated Applications	arXiv	link	-
2024.05	Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks	LREC-COLING'24	link	link
2024.05	LLM Jailbreak Attack versus Defense Techniques--A Comprehensive Study	NDSS'24	link	-
2024.05	Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study	arXiv	link	-
2024.05	Detoxifying Large Language Models via Knowledge Editing (SafeEdit)	ACL'24	link	link
2024.04	JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (JailbreakLens)	arXiv	link	-
2024.03	How (un) ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries (TECHHAZARDQA)	arXiv	link	link
2024.03	EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (EasyJailbreak)	arXiv	link	link
2024.02	Comprehensive Assessment of Jailbreak Attacks Against LLMs	arXiv	link	-
2024.02	SPML: A DSL for Defending Language Models Against Prompt Attacks	arXiv	link	-
2024.02	Coercing LLMs to do and reveal (almost) anything	arXiv	link	-
2024.02	A STRONGREJECT for Empty Jailbreaks (StrongREJECT)	arXiv	link	link
2024.02	ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages	ACL'24	link	link
2024.02	HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (HarmBench)	arXiv	link	link
2023.12	Goal-Oriented Prompt Attack and Safety Evaluation for LLMs	arXiv	link	link
2023.12	The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness	arXiv	link	-
2023.12	A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models	UbiSec'23	link	-
2023.11	Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild	arXiv	link	-
2023.11	How many unicorns are in this image? a safety evaluation benchmark for vision llms	arXiv	link	link
2023.11	Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles	arXiv	link	-
2023.10	Explore, establish, exploit: Red teaming language models from scratch	arXiv	link	-
2023.10	Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks	arXiv	link	-
2023.10	Fine-tuning aligned language models compromises safety, even when users do not intend to! (HEx-PHI)	ICLR'24 (oral)	link	link
2023.08	Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-EVAL)	arXiv	link	link
2023.08	Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities	arXiv	link	-
2023.07	Jailbroken: How Does LLM Safety Training Fail? (Jailbroken)	NeurIPS'23	link	-
2023.08	Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities	arXiv	link	-
2023.08	From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy	IEEE Access	link	-
2023.07	Llm censorship: A machine learning challenge or a computer security problem?	arXiv	link	-
2023.07	Universal and Transferable Adversarial Attacks on Aligned Language Models (AdvBench)	arXiv	link	link
2023.06	DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models	NeurIPS'23	link	link
2023.04	Safety Assessment of Chinese Large Language Models	arXiv	link	link
2023.02	Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks	arXiv	link	-
2022.11	Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned	arXiv	link	-
2022.02	Red Teaming Language Models with Language Models	arXiv	link	-