LostOxygen / llm-confidentiality

Whispers in the Machine: Confidentiality in LLM-integrated Systems

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Whispers in the Machine: Confidentiality in LLM-integrated Systems

This is the code repository accompanying our paper Whispers in the Machine: Confidentiality in LLM-integrated Systems.

Large Language Models (LLMs) are increasingly integrated with external tools. While these integrations can significantly improve the functionality of LLMs, they also create a new attack surface where confidential data may be disclosed between different components. Specifically, malicious tools can exploit vulnerabilities in the LLM itself to manipulate the model and compromise the data of other services, raising the question of how private data can be protected in the context of LLM integrations.

In this work, we provide a systematic way of evaluating confidentiality in LLM-integrated systems. For this, we formalize a "secret key" game that can capture the ability of a model to conceal private information. This enables us to compare the vulnerability of a model against confidentiality attacks and also the effectiveness of different defense strategies. In this framework, we evaluate eight previously published attacks and four defenses. We find that current defenses lack generalization across attack strategies. Building on this analysis, we propose a method for robustness fine-tuning, inspired by adversarial training.
This approach is effective in lowering the success rate of attackers and in improving the system's resilience against unknown attacks.

If you want to cite our work, please use the this BibTeX entry.

This framework was developed to study the confidentiality of Large Language Models (LLMs). The framework contains several features:

  • A set of attacks against LLMs, where the LLM is not allowed to leak a secret key -> jump
  • A set of defenses against the aforementioned attacks -> jump
  • Creating enhanced system prompts to safely instruct an LLM to keep a secret key safe -> jump
  • Finetune LLMs to harden them against these attacks using the datasets -> jump

Warning

Hardware aceleration is only fully supported for CUDA machines running Linux. Windows or MacOS with CUDA/MPS could face some issues.

Setup

Before running the code, install the requirements:

python -m pip install --upgrade -r requirements.txt

Create both a key.txt file containing your OpenAI API key as well as a hf_token.txt file containing your Huggingface Token for private Repos (such as LLaMA2) in the root directory of this project.

Sometimes it can be necessary to login to your Huggingface account via the CLI:

git config --global credential.helper store
huggingface-cli login

Distributed Training

All scripts are able to work on multiple GPUs/CPUs using the accelerate library. To do so, run:

accelerate config

to configure the distributed training capabilities of your system and start the scripts with:

accelerate launch [parameters] <script.py> [script parameters]

Attacks and Defenses

Usage

python attack.py [-h] [-a | --attacks [ATTACK1, ATTACK2, ..]] [-d | --defense DEFENSE] [-llm | --llm_type LLM_TYPE] [-m | --iterations ITERATIONS] [-t | --temperature TEMPERATURE]

Example Usage

python attack.py --attacks "payload_splitting" "obfuscation" --defense "xml_tagging" --iterations 15 --llm_type "llama2-7b" --temperature 0.7

Arguments

Argument Type Default Value Description
-h, --help - - show this help message and exit
-a, --attacks List[str] payload_splitting specifies the attacks which will be utilized against the LLM
-d, --defense str None specifies the defense for the LLM
-llm, --llm_type str gpt-3.5-turbo specifies the type of opponent
-le, --llm_guessing bool False specifies whether a second LLM is used to guess the secret key off the normal response or not
-t, --temperature float 0.0 specifies the temperature for the LLM to control the randomness
-cp, --create_prompt_dataset bool False specifies whether a new dataset of enhanced system prompts should be created
-cr, --create_response_dataset bool False specifies whether a new dataset of secret leaking responses should be created
-i, --iterations int 10 specifies the number of iterations for the attack
-n, --name_suffix str "" Specifies a name suffix to load custom models. Since argument parameter strings aren't allowed to start with '-' symbols, the first '-' will be added by the parser automatically
-s, --strategy str None Specifies the strategy for the attack (whether to use normal attacks or tools attacks)
-sc, --scenario str all Specifies the scenario for the tool based attacks
-dx, --device str cpu Specifies the device which is used for running the script (cpu, cuda, or mps)
-pf, --prompt_format str react Specifies whether react or tool-finetuned prompt format is used for agents. (react or tool-finetuned)
-ds, --disable_safeguards bool False Disables system prompt safeguards for tool strategy
The naming conventions for the models are as follows:
<model_name>-<param_count>-<robustness>-<attack_suffix>-<custom_suffix>

e.g.:

llama2-7b-robust-prompt_injection-0613

If you want to run the attacks against a prefix-tuned model with a custom suffix (e.g., 1000epochs), you would have to specify the arguments a follows:

... --model_name llama2-7b-prefix --name_suffix 1000epochs ...

Supported Large Language Models

Model Parameter Specifier Link Compute Instance
GPT-4 (o1, o1-mini, turbo) gpt-4o / gpt-4o-mini / gpt-4-turbo Link OpenAI API
LLaMA 2 llama2-7b / llama2-13b / llama2-70b Link Local Inference
LLaMA 2 hardened llama2-7b-robust / llama2-13b-robust / llama2-70b-robust Link Local Inference
Llama 3.1 llama3-8b / llama3-70b Link Local Inference (first: ollama pull llama3.1/llama3.1:70b/llama3.1:405b)
Llama 3.2 llama3-1b/ llama3-3b Link Local Inference (first: ollama pull llama3.2/llama3.2:1b)
Reflection Llama reflection-llama Link Local Inference (first: ollama pull reflection)
Vicuna vicuna-7b / vicuna-13b / vicuna-33b Link Local Inference
StableBeluga (2) beluga-7b / beluga-13b / beluga2-70b Link Local Inference
Orca 2 orca2-7b / orca2-13b / orca2-70b Link Local Inference
Gemma gemma-2b / gemma-7b Link Local Inference
Gemma 2 gemma2-9b / gemma2-27b Link Local Inference (first: ollama pull gemma2/gemma2:27b)
Phi 3 phi3-3b / phi3-14b Link Local Inference (first: ollama pull phi3:mini/phi3:medium)

(Finetuned or robust/hardened LLaMA models first have to be generated using the finetuning.py script, see below)

Supported Attacks and Defenses

Attacks Defenses
Name Specifier Name Specifier
Payload Splitting payload_splitting Random Sequence Enclosure seq_enclosure
Obfuscation obfuscation XML Tagging xml_tagging
Jailbreak jailbreak Heuristic/Filtering Defense heuristic_defense
Translation translation Sandwich Defense sandwiching
ChatML Abuse chatml_abuse LLM Evaluation llm_eval
Masking masking Perplexity Detection ppl_detection
Typoglycemia typoglycemia PromptGuard prompt_guard
Adversarial Suffix advs_suffix
Prefix Injection prefix_injection
Refusal Suppression refusal_suppression
Context Ignoring context_ignoring
Context Termination context_termination
Context Switching Separators context_switching_separators
Few-Shot few_shot
Cognitive Hacking cognitive_hacking
Base Chat base_chat

The base_chat attack consists of normal questions to test of the model spills it's context and confidential information even without a real attack.


Finetuning

This section covers the possible LLaMA finetuning options. We use PEFT, which is based on this paper.

Setup

Additionally to the above setup run

accelerate config

to configure the distributed training capabilities of your system. And

wandb login

with your WandB API key to enable logging of the finetuning process.


Parameter Efficient Finetuning to harden LLMs against attacks or create enhanced system prompts

The first finetuning option is on a dataset consisting of system prompts to safely instruct an LLM to keep a secret key safe. The second finetuning option (using the --train_robust option) is using system prompts and adversarial prompts to harden the model against prompt injection attacks.

Usage

python finetuning.py [-h] [-llm | --llm_type LLM_NAME] [-i | --iterations ITERATIONS] [-a | --attacks ATTACKS_LIST] [-n | --name_suffix NAME_SUFFIX]

Arguments

Argument Type Default Value Description
-h, --help - - Show this help message and exit
-llm, --llm_type str llama3-8b Specifies the type of llm to finetune
-i, --iterations int 10000 Specifies the number of iterations for the finetuning
-advs, --advs_train bool False Utilizes the adversarial training to harden the finetuned LLM
-a, --attacks List[str] payload_splitting Specifies the attacks which will be used to harden the llm during finetuning. Only has an effect if --train_robust is set to True. For supported attacks see the previous section
-n, --name_suffix str "" Specifies a suffix for the finetuned model name

Supported Large Language Models

Currently only the LLaMA models are supported (llama2-7/13/70b / llama3-8/70b).

Generate System Prompt Datasets

Simply run the generate_dataset.py script to create new system prompts as a json file using LLMs.

Arguments

Argument Type Default Value Description
-h, --help - - Show this help message and exit
-llm, --llm_type str llama3-70b Specifies the LLM used to generate the system prompt dataset
-n, --name_suffix str "" Specifies a suffix for the model name if you want to use a custom model
-ds, --dataset_size int 1000 Size of the resulting system prompt dataset

Citation

If you want to cite our work, please use the following BibTeX entry:

@article{evertz-24-whispers,
	title    =  {{Whispers in the Machine: Confidentiality in LLM-integrated Systems}}, 
	author   =  {Jonathan Evertz and Merlin Chlosta and Lea Schönherr and Thorsten Eisenhofer},
	year     =  {2024},
	journal  =  {Computing Research Repository (CoRR)}
}

About

Whispers in the Machine: Confidentiality in LLM-integrated Systems

License:Apache License 2.0


Languages

Language:Python 100.0%