🛡️Awesome LLM-Safety🛡️

English | 中文

🤗Introduction

Welcome to our Awesome-llm-safety repository! 🥰🥰🥰

🔥 News

2024.05 update NAACL 2024 Papers Collection, thanks @zhrli324, @feqHe!

🧑‍💻 Our Work

We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.

If a resource is relevant to multiple subcategories, we place it under each applicable section. For instance, the "Awesome-LLM-Safety" repository will be listed under each subcategory to which it pertains🤩!.

✔️ Perfect for Majority

For beginners curious about llm-safety, our repository serves as a compass for grasping the big picture and diving into the details. Classic or influential papers retained in the README provide a beginner-friendly navigation through interesting directions in the field;
For seasoned researchers, this repository is a tool to keep you informed and fill any gaps in your knowledge. Within each subtopic, we are diligently updating all the latest content and continuously backfilling with previous work. Our thorough compilation and careful selection are time-savers for you.

🧭 How to Use this Guide

Quick Start: In the README, users can find a curated list of select information sorted by date, along with links to various consultations.
In-Depth Exploration: If you have a special interest in a particular subtopic, delve into the "subtopic" folder for more. Each item, be it an article or piece of news, comes with a brief introduction, allowing researchers to swiftly zero in on relevant content.

💼 How to Contribution

If you have completed an insightful work or carefully compiled conference papers, we would love to add your work to the repository.

For individual papers, you can raise an issue, and we will quickly add your paper under the corresponding subtopic.
If you have compiled a collection of papers for a conference, you are welcome to submit a pull request directly. We would greatly appreciate your contribution. Please note that these pull requests need to be consistent with our existing format.

📜Advertisement

🌱 If you would like more people to read your recent insightful work, please contact me via email. I can offer you a promotional spot here for up to one month.

Let’s start LLM Safety tutorial!

🚀Table of Contents

🛡️Awesome LLM-Safety🛡️
- 🤗Introduction
- 🚀Table of Contents
- [🔐Security & Discussion](#security & discussion)
- 🔏Privacy
- 📰Truthfulness & Misinformation
- 😈JailBreak & Attacks
- [🛡️Defenses & Mitigation](#️defenses & mitigation)
  - 📖Tutorials, Articles, Presentations and Talks
  - Other
- 💯Datasets & Benchmark
- 🧑‍🏫 Scholars 👩‍🏫
- 🧑‍🎓Author

🤔AI Safety & Security Discussions

Date	Link	Publication	Authors
2024/5/20	Managing extreme AI risks amid rapid progress	Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann	Science

🔐Security & Discussion

📑Papers

Date	Institute	Publication	Paper
20.10	Facebook AI Research	arxiv	Recipes for Safety in Open-domain Chatbots
22.03	OpenAI	NIPS2022	Training language models to follow instructions with human feedback
23.07	UC Berkeley	NIPS2023	Jailbroken: How Does LLM Safety Training Fail?
23.12	OpenAI	Open AI	Practices for Governing Agentic AI Systems

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
22.02	Toxicity Detection API	Perspective API	link paper
23.07	Repository	Awesome LLM Security	link
23.10	Tutorials	Awesome-LLM-Safety	link
24.01	Tutorials	Awesome-LM-SSP	link

Other

👉Latest&Comprehensive Security Paper

🔏Privacy

📑Papers

Date	Institute	Publication	Paper
19.12	Microsoft	CCS2020	Analyzing Information Leakage of Updates to Natural Language Models
21.07	Google Research	ACL2022	Deduplicating Training Data Makes Language Models Better
21.10	Stanford	ICLR2022	Large language models can be strong differentially private learners
22.02	Google Research	ICLR2023	Quantifying Memorization Across Neural Language Models
22.02	UNC Chapel Hill	ICML2022	Deduplicating Training Data Mitigates Privacy Risks in Language Models

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link
24.01	Tutorials	Awesome-LM-SSP	link

Other

👉Latest&Comprehensive Privacy Paper

📰Truthfulness & Misinformation

📑Papers

Date	Institute	Publication	Paper
21.09	University of Oxford	ACL2022	TruthfulQA: Measuring How Models Mimic Human Falsehoods
23.11	Harbin Institute of Technology	arxiv	A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
23.11	Arizona State University	arxiv	Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.07	Repository	llm-hallucination-survey	link
23.10	Repository	LLM-Factuality-Survey	link
23.10	Tutorials	Awesome-LLM-Safety	link

Other

👉Latest&Comprehensive Truthfulness&Misinformation Paper

😈JailBreak & Attacks

📑Papers

Date	Institute	Publication	Paper
20.12	Google	USENIX Security 2021	Extracting Training Data from Large Language Models
22.11	AE Studio	NIPS2022(ML Safety Workshop)	Ignore Previous Prompt: Attack Techniques For Language Models
23.06	Google	arxiv	Are aligned neural networks adversarially aligned?
23.07	CMU	arxiv	Universal and Transferable Adversarial Attacks on Aligned Language Models
23.10	University of Pennsylvania	arxiv	Jailbreaking Black Box Large Language Models in Twenty Queries

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.01	Community	Reddit/ChatGPTJailbrek	link
23.02	Resource&Tutorials	Jailbreak Chat	link
23.10	Tutorials	Awesome-LLM-Safety	link
23.10	Article	Adversarial Attacks on LLMs(Author: Lilian Weng)	link
23.11	Video	[1hr Talk] Intro to Large Language Models From 45:45(Author: Andrej Karpathy)	link

Other

👉Latest&Comprehensive JailBreak & Attacks Paper

🛡️Defenses & Mitigation

📑Papers

Date	Institute	Publication	Paper
21.07	Google Research	ACL2022	Deduplicating Training Data Makes Language Models Better
22.04	Anthropic	arxiv	Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link

Other

👉Latest&Comprehensive Defenses Paper

💯Datasets & Benchmark

📑Papers

Date	Institute	Publication	Paper
20.09	University of Washington	EMNLP2020(findings)	RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
21.09	University of Oxford	ACL2022	TruthfulQA: Measuring How Models Mimic Human Falsehoods
22.03	MIT	ACL2022	ToxiGen: A Large-Scale Machine-Generated datasets for Adversarial and Implicit Hate Speech Detection

📖Tutorials, Articles, Presentations and Talks

Date	Type	Title	URL
23.10	Tutorials	Awesome-LLM-Safety	link

📚Resource📚

Toxicity - RealToxicityPrompts datasets
Truthfulness - TruthfulQA datasets

Other

👉Latest&Comprehensive datasets & Benchmark Paper

🧑‍🏫 Scholars 👩‍🏫

In this section, we list some of the scholars we consider to be experts in the field of LLM Safety!

The standard is that the scholar's papers have been cited over 1,000 times. If you are a senior researcher in this field, or if you find that we have missed any senior researchers, please feel free to raise an issue.

Scholars	HomePage&Google Scholars	Keywords or Interested
Nicholas Carlini	Homepage \| Google Scholar	the intersection of machine learning and computer security&neural networks from an adversarial perspective
Daphne Ippolito	Google Scholar	Natural Language Processing
Chiyuan Zhang	Homepage \| Google Scholar	Especially interested in understanding the generalization and memorization in machine and human learning, as well as implications in related areas like privacy
Katherine Lee	Google Scholar	natural language processing&translation&machine learning&computational neuroscienceattention
Florian Tramèr	Homepage \| Google Scholar	Computer Security&Machine Learning&Cryptography&the worst-case behavior of Deep Learning systems from an adversarial perspective, to understand and mitigate long-term threats to the safety and privacy of users
Jindong Wang	Homepage \| Google Scholar	Large Language Models (LLMs) evaluation and robustness enhancement
Chaowei Xiao	Homepage \| Google Scholar	interested in exploring the trustworthy problem in (MultiModal) Large Language Models and studying the role of LLMs in different application domains.
Andy Zou	Homepage \| Google Scholar	ML Safety&AI Safety
Yang Liu	Homepage\|Google Scholar	Cybersecurity, Software Engineering and Artificial Integllence

🧑‍🎓Author

🤗If you have any questions, please contact our authors!🤗

✉️: ydyjya ➡️ zhouzhenhong@bupt.edu.cn

💬: LLM Safety Discussion

Wechat Group | My Wechat

⬆ Back to ToC

About

A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights into the safety implications, challenges, and advancements surrounding these powerful models.