awesome-trustworthy-LLMs

This repository contains a collection of resources and papers on trustworthy Large Language Models.

Resources
Papers
- Survey
- Explainability
- Security
- Robustness
- Privacy
- Fairness

Resources

Introductory Posts

GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses.
OpenAI
[Website]
14 Mar 2023

Princeton COS 597G (Fall 2022): Understanding Large Language Models
Danqi Chen
[Website]

Stanford CS324 - Large Language Models
Percy Liang, Tatsunori Hashimoto, Christopher Ré, Rishi Bommasani, Sang Michael Xie
[website]

Papers

Survey

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung
28 Feb 2023. [Paper]

Explainability

Security

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto
11 Feb 2023. [Paper]

Probing Toxic Content in Large Pre-Trained Language Models Nedjma Ousidhoum, Xinran Zhao, Tianqing Fang, Yangqiu Song, Dit-Yan Yeung ACL 2021 [Paper]

More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, Mario Fritz
23 Feb 2023. [Paper]

Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez, Ian Ribeiro 17 Nov 2022 [Paper]

Robustness

Adversarial Prompting for Black Box Foundation Models** Natalie Maus, Patrick Chao, Eric Wong, Jacob Gardner 8 Feb 2023 [Paper]

Privacy

On the Feasibility of Specialized Ability Stealing for Large Language Code Models
Zongjie Li, Chaozheng Wang, Pingchuan Ma, Chaowei Liu, Shuai Wang, Daoyuan Wu, Cuiyun Gao
6 Mar 2023. [Paper]

Large Language Models Can Be Strong Differentially Private Learners
Xuechen Li, Florian Tramèr, Percy Liang, Tatsunori Hashimoto
ICLR 2022 [Paper]

Differentially Private Natural Language Models: Recent Advances and Future Directions
Lijie Hu, Ivan Habernal, Lei Shen, Di Wang
[Paper]

Differentially Private Language Models for Secure Data Sharing
Justus Mattern, Zhijing Jin, Benjamin Weggenmann, Bernhard Schoelkopf, Mrinmaya Sachan
EMNLP 2022 [Paper]

Differentially Private Fine-tuning of Language Models
Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, Huishuai Zhang
Extracting Training Data from Large Language Models
ICLR 2022 [Paper]

Extracting Training Data from Large Language Models Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel
USENIX 2021 [Paper]

The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, Dawn Song
USENIX 2019 [Paper]

Can Foundation Models Help Us Achieve Perfect Secrecy?
Simran Arora, Christopher Ré
Jan 2023 [[Paper](Can Foundation Models Help Us Achieve Perfect Secrecy?)]

Reconstruction Attack on Instance Encoding for Language Understanding
Shangyu Xie, Yuan Hong
EMNLP 2021 [Paper]

Privacy Risks of General-Purpose Language Models

On a Utilitarian Approach to Privacy Preserving Text Generation

TextHide: Tackling Data Privacy in Language Understanding Tasks

Comprehensive Privacy Analysis of Deep Learning
Milad Nasr, Reza Shokri, Amir Houmansadr
Dec 2018 [Paper]

jiahuigeng / awesome-trustworthy-LLMs