Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses. Any additional things regarding jailbreak, PRs, issues are welcome and we are glad to add you to the contributor list here. Any problems, please contact yliu@u.nus.edu. If you find this repository useful to your research or work, it is really appreciated to star this repository. ✨
Year | Title | Venue | Paper | Code |
---|---|---|---|---|
2024.07 | Revisiting Character-level Adversarial Attacks for Language Models | ICML'24 | link | link |
2024.07 | Badllama 3: removing safety finetuning from Llama 3 in minutes (Badllama 3) | arXiv | link | - |
2024.07 | SOS! Soft Prompt Attack Against Open-Source Large Language Models | arXiv | link | - |
2024.06 | Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (I-FSJ) | ICML Workshop'24 | link | link |
2024.06 | COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (COLD-Attack) | ICML'24 | link | link |
2024.06 | Improved Techniques for Optimization-Based Jailbreaking on Large Language Models (I-GCG) | arXiv | link | link |
2024.05 | Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs | arXiv | link | |
2024.05 | AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (AutoDAN) | ICLR'24 | link | link |
2024.05 | AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (AmpleGCG) | arXiv | link | link |
2024.05 | Boosting jailbreak attack with momentum (MAC) | ICLR Workshop'24 | link | link |
2024.04 | AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (AdvPrompter) | arXiv | link | link |
2024.03 | Universal Jailbreak Backdoors from Poisoned Human Feedback | ICLR'24 | link | - |
2024.02 | Attacking large language models with projected gradient descent (PGD) | arXiv | link | - |
2024.02 | Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering (JRE) | arXiv | link | - |
2024.02 | Curiosity-driven red-teaming for large language models (CRT) | arXiv | link | link |
2023.12 | AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (AutoDAN) | arXiv | link | link |
2023.10 | Catastrophic jailbreak of open-source llms via exploiting generation | ICLR'24 | link | link |
2023.06 | Automatically Auditing Large Language Models via Discrete Optimization (ARCA) | ICML'23 | link | link |
2023.07 | Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) | arXiv | link | link |
Time | Title | Venue | Paper | Code |
---|---|---|---|---|
2024.07 | A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses | arXiv | link | - |
2024.07 | Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (Virtual Context) | arXiv | link | - |
2024.07 | SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (SoP) | arXiv | link | link |
2024.06 | When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search (RLbreaker) | arXiv | link | - |
2024.06 | Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (Agent Smith) | ICML'24 | link | link |
2024.06 | Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation | ICML'24 | link | - |
2024.06 | ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (ArtPrompt) | ACL'24 | link | link |
2024.06 | From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (ASETF) | arXiv | link | - |
2024.06 | CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (CodeAttack) | arXiv | link | - |
2024.06 | Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (DRA) | USENIX Security'24 | link | link |
2024.06 | AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (AutoJailbreak) | arXiv | link | - |
2024.06 | Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks | arXiv | link | link |
2024.06 | GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (GPTFuzz) | arXiv | link | link |
2024.06 | A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM) | NAACL'24 | link | link |
2024.06 | QROA: A Black-Box Query-Response Optimization Attack on LLMs (QROA) | arXiv | link | link |
2024.06 | Poisoned LangChain: Jailbreak LLMs by LangChain (PLC) | arXiv | link | link |
2024.05 | Multilingual Jailbreak Challenges in Large Language Models | ICLR'24 | link | link |
2024.05 | DeepInception: Hypnotize Large Language Model to Be Jailbreaker (DeepInception) | arXiv | link | link |
2024.05 | GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS) | ACL'24 | link | - |
2024.05 | GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of LLMs (GUARD) | arXiv | link | - |
2024.05 | "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (DAN) | CCS'24 | link | link |
2024.05 | Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher (SelfCipher) | ICLR'24 | link | link |
2024.05 | Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (JAM) | arXiv | link | - |
2024.05 | Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICA) | arXiv | link | - |
2024.04 | Many-shot jailbreaking (MSJ) | Anthropic | link | - |
2024.04 | PANDORA: Detailed LLM jailbreaking via collaborated phishing agents with decomposed reasoning (PANDORA) | ICLR Workshop'24 | link | - |
2024.04 | Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Crescendo) | Anthropic | link | - |
2024.04 | Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models (FuzzLLM) | ICASSP'24 | link | link |
2024.04 | Sandwich attack: Multi-language mixture adaptive attack on llms (Sandwich attack) | arXiv | link | - |
2024.03 | Tastle: Distract large language models for automatic jailbreak attack (TASTLE) | arXiv | link | - |
2024.03 | DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (DrAttack) | arXiv | link | link |
2024.02 | PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (PRP) | arXiv | link | - |
2024.02 | CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (CodeChameleon) | arXiv | link | link |
2024.02 | PAL: Proxy-Guided Black-Box Attack on Large Language Models (PAL) | arXiv | link | link |
2024.02 | Jailbreaking Proprietary Large Language Models using Word Substitution Cipher | arXiv | link | - |
2024.02 | Query-Based Adversarial Prompt Generation | arXiv | link | - |
2024.02 | Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning (Pandora) | arXiv | link | - |
2024.02 | Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks (Contextual Interaction Attack) | arXiv | link | - |
2024.02 | Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (SMJ) | arXiv | link | - |
2024.02 | Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | NAACL'24 | link | link |
2024.01 | Low-Resource Languages Jailbreak GPT-4 | NeurIPS Workshop'24 | link | - |
2024.01 | How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP) | arXiv | link | link |
2023.12 | Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP) | arXiv | link | link |
2023.12 | Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs | arXiv | link | - |
2023.12 | Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition | ACL'24 | link | - |
2023.11 | Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Persona) | NeurIPS Workshop'23 | link | - |
2023.10 | Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) | arXiv | link | link |
2023.10 | Adversarial Demonstration Attacks on Large Language Models (advICL) | arXiv | link | - |
2023.10 | MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots (MASTERKEY) | NDSS'24 | link | - |
2023.10 | Attack Prompt Generation for Red Teaming and Defending Large Language Models (SAP) | EMNLP'23 | link | link |
2023.10 | An LLM can Fool Itself: A Prompt-Based Adversarial Attack (PromptAttack) | ICLR'24 | link | link |
2023.09 | Multi-step Jailbreaking Privacy Attacks on ChatGPT (MJP) | EMNLP Findings'23 | link | link |
2023.09 | Open Sesame! Universal Black Box Jailbreaking of Large Language Models (GA) | arXiv | link | - |
2023.05 | Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection | arXiv | link | link |
2022.11 | Ignore Previous Prompt: Attack Techniques For Language Models (PromptInject) | NeurIPS WorkShop'22 | link | link |
Time | Title | Venue | Paper | Code |
---|---|---|---|---|
2024.07 | Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything | arXiv | link | - |
2024.06 | Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt | arXiv | link | link |
2024.05 | Voice Jailbreak Attacks Against GPT-4o | arXiv | link | link |
2024.05 | Automatic Jailbreaking of the Text-to-Image Generative AI Systems | ICML'24 Workshop | link | link |
2024.04 | Image hijacks: Adversarial images can control generative models at runtime | arXiv | link | link |
2024.03 | An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models (CroPA) | ICLR'24 | link | link |
2024.03 | Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model | ICLR'24 | link | - |
2024.03 | Rethinking model ensemble in transfer-based adversarial attacks | ICLR'24 | link | link |
2024.02 | VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models | NeurIPS'23 | link | link |
2024.02 | Jailbreaking Attack against Multimodal Large Language Model | arXiv | link | - |
2024.01 | Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | arXiv | link | - |
2024.03 | Visual Adversarial Examples Jailbreak Aligned Large Language Models | AAAI'24 | link | - |
2023.12 | OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization (OT-Attack) | arXiv | link | - |
2023.12 | FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts (FigStep) | arXiv | link | link |
2023.11 | On Evaluating Adversarial Robustness of Large Vision-Language Models | NeurIPS'23 | link | link |
2023.10 | How Robust is Google's Bard to Adversarial Image Attacks? | arXiv | link | link |
2023.08 | AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning (AdvCLIP) | ACM MM'23 | link | link |
2023.07 | Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (SGA) | ICCV'23 | link | link |
2023.07 | On the Adversarial Robustness of Multi-Modal Foundation Models | ICCV Workshop'23 | link | - |
2022.10 | Towards Adversarial Attack on Vision-Language Pre-training Models | arXiv | link | link |
Time | Title | Venue | Paper | Code |
---|---|---|---|---|
2024.07 | DART: Deep Adversarial Automated Red Teaming for LLM Safety | arXiv | link | - |
2024.07 | Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks | arXiv | link | link |
2024.06 | Jatmo: Prompt Injection Defense by Task-Specific Finetuning (Jatmo) | arXiv | link | link |
2024.06 | Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization (SafeDecoding) | ACL'24 | link | link |
2024.06 | Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | arXiv | link | link |
2024.06 | On Prompt-Driven Safeguarding for Large Language Models (DRO) | arXiv | link | link |
2024.06 | Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO) | arXiv | link | - |
2024.06 | Fight Back Against Jailbreaking via Prompt Adversarial Tuning (PAT) | arXiv | link | link |
2024.05 | Detoxifying Large Language Models via Knowledge Editing (DINM) | ACL'24 | link | link |
2024.05 | Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing | arXiv | link | link |
2023.11 | MART: Improving LLM Safety with Multi-round Automatic Red-Teaming (MART) | ACL'24 | link | - |
2023.11 | Baseline defenses for adversarial attacks against aligned language models | arXiv | link | - |
2023.10 | Safe rlhf: Safe reinforcement learning from human feedback | arXiv | link | link |
2023.08 | Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-INSTRUCT) | arXiv | link | link |
2022.04 | Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | Anthropic | link | - |
Time | Title | Venue | Paper | Code |
---|---|---|---|---|
2024.06 | SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (SafeDecoding) | ACL'24 | link | link |
2024.06 | Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM | arXiv | link | - |
2024.06 | A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM) | NAACL'24 | link | link |
2024.06 | SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks | arXiv | link | link |
2024.05 | Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (Dual-critique) | arXiv | link | link |
2024.05 | PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (PARDEN) | arXiv | link | link |
2024.05 | LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked | ICLR Tiny Paper'24 | link | link |
2024.05 | GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis (GradSafe) | ACL'24 | link | link |
2024.05 | Multilingual Jailbreak Challenges in Large Language Models | ICLR'24 | link | link |
2024.05 | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes | arXiv | link | - |
2024.05 | AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks | arXiv | link | link |
2024.05 | Bergeron: Combating adversarial attacks through a conscience-based alignment framework (Bergeron) | arXiv | link | link |
2024.05 | Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICD) | arXiv | link | - |
2024.04 | Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning | arXiv | link | link |
2024.03 | AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (AdaShield) | ECCV'24 | link | link |
2024.02 | Certifying LLM Safety against Adversarial Prompting | arXiv | link | link |
2024.02 | Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | arXiv | link | - |
2024.02 | Defending large language models against jailbreak attacks via semantic smoothing (SEMANTICSMOOTH) | arXiv | link | link |
2024.01 | How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP) | arXiv | link | link |
2023.12 | Defending ChatGPT against jailbreak attack via self-reminders (Self-Reminder) | Nature Machine Intelligence | link | link |
2023.11 | Detecting language model attacks with perplexity | arXiv | link | - |
2023.10 | RAIN: Your Language Models Can Align Themselves without Finetuning (RAIN) | arXiv | link | link |
Time | Title | Venue | Paper | Code |
---|---|---|---|---|
2024.07 | JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | arXiv | link | link |
2024.07 | WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (WildGuard) | arXiv | link | link |
2024.07 | Jailbreak Attacks and Defenses Against Large Language Models: A Survey | arXiv | link | - |
2024.06 | WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models (WildTeaming) | arXiv | link | link |
2024.06 | From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking | arXiv | link | - |
2024.06 | AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways | arXiv | link | - |
2024.06 | MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (MM-SafetyBench) | arXiv | link | - |
2024.06 | ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (VITC) | ACL'24 | link | link |
2024.06 | Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs | arXiv | link | link |
2024.06 | JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models (JailbreakZoo) | arXiv | link | link |
2024.06 | Fundamental limitations of alignment in large language models | arXiv | link | - |
2024.06 | JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models (JailbreakBench) | arXiv | link | link |
2024.06 | Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis | arXiv | link | link |
2024.06 | JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models (JailbreakEval) | arXiv | link | link |
2024.05 | Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (INDust) | arXiv | link | link |
2024.05 | Prompt Injection attack against LLM-integrated Applications | arXiv | link | - |
2024.05 | Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks | LREC-COLING'24 | link | link |
2024.05 | LLM Jailbreak Attack versus Defense Techniques--A Comprehensive Study | NDSS'24 | link | - |
2024.05 | Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study | arXiv | link | - |
2024.05 | Detoxifying Large Language Models via Knowledge Editing (SafeEdit) | ACL'24 | link | link |
2024.04 | JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (JailbreakLens) | arXiv | link | - |
2024.03 | How (un) ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries (TECHHAZARDQA) | arXiv | link | link |
2024.03 | EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (EasyJailbreak) | arXiv | link | link |
2024.02 | Comprehensive Assessment of Jailbreak Attacks Against LLMs | arXiv | link | - |
2024.02 | SPML: A DSL for Defending Language Models Against Prompt Attacks | arXiv | link | - |
2024.02 | Coercing LLMs to do and reveal (almost) anything | arXiv | link | - |
2024.02 | A STRONGREJECT for Empty Jailbreaks (StrongREJECT) | arXiv | link | link |
2024.02 | ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages | ACL'24 | link | link |
2024.02 | HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (HarmBench) | arXiv | link | link |
2023.12 | Goal-Oriented Prompt Attack and Safety Evaluation for LLMs | arXiv | link | link |
2023.12 | The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness | arXiv | link | - |
2023.12 | A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models | UbiSec'23 | link | - |
2023.11 | Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild | arXiv | link | - |
2023.11 | How many unicorns are in this image? a safety evaluation benchmark for vision llms | arXiv | link | link |
2023.11 | Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles | arXiv | link | - |
2023.10 | Explore, establish, exploit: Red teaming language models from scratch | arXiv | link | - |
2023.10 | Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | arXiv | link | - |
2023.10 | Fine-tuning aligned language models compromises safety, even when users do not intend to! (HEx-PHI) | ICLR'24 (oral) | link | link |
2023.08 | Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-EVAL) | arXiv | link | link |
2023.08 | Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities | arXiv | link | - |
2023.07 | Jailbroken: How Does LLM Safety Training Fail? (Jailbroken) | NeurIPS'23 | link | - |
2023.08 | Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities | arXiv | link | - |
2023.08 | From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy | IEEE Access | link | - |
2023.07 | Llm censorship: A machine learning challenge or a computer security problem? | arXiv | link | - |
2023.07 | Universal and Transferable Adversarial Attacks on Aligned Language Models (AdvBench) | arXiv | link | link |
2023.06 | DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models | NeurIPS'23 | link | link |
2023.04 | Safety Assessment of Chinese Large Language Models | arXiv | link | link |
2023.02 | Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks | arXiv | link | - |
2022.11 | Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned | arXiv | link | - |
2022.02 | Red Teaming Language Models with Language Models | arXiv | link | - |