LLMSanitize

An open-source library for contamination detection in NLP datasets and Large Language Models (LLMs).

Installation

The library has been designed and tested with Python 3.9 and CUDA 11.8.

First make sure you have CUDA 11.8 installed, and create a conda environment with Python 3.9:

conda create --name llmsanitize python=3.9

Next activate the environment:

conda activate llmsanitize

Then install LLMSanitize from PyPI:

pip install llmsanitize

Notably, we use vllm 0.3.3.

Supported Methods

The repository supports all the following contamination detection methods:

Method	Use Case	Method Type	Model Access	Reference
gpt-2	Open-data	String Matching	_	Language Models are Unsupervised Multitask Learners (link), Section 4
gpt-3	Open-data	String Matching	_	Language Models are Few-Shot Learners (link), Section 4
exact	Open-data	String Matching	_	Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus (link), Section 4.2
palm	Open-data	String Matching	_	PaLM: Scaling Language Modeling with Pathways (link), Sections 7-8
gpt-4	Open-data	String Matching	_	GPT-4 Technical Report (link), Appendix C
platypus	Open-data	Embeddings Similarity	_	Platypus: Quick, Cheap, and Powerful Refinement of LLMs (link), Section 2.3
guided-prompting	Closed-data	Prompt Engineering/LLM-based	Black-box	Time Travel in LLMs: Tracing Data Contamination in Large Language Models (link)
sharded-likelihood	Closed-data	Model Likelihood	White-box	Proving Test Set Contamination in Black-box Language Models (link)
min-prob	Closed-data	Model Likelihood	White-box	Detecting Pretraining Data from Large Language Models (link)
cdd	Closed-data	Model Memorization/Model Likelihood	Black-box	Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models (link), Section 3.2
ts-guessing-question-based	Closed-data	Model Completion	Black-box	Investigating Data Contamination in Modern Benchmarks for Large Language Models (link), Section 3.2.1
ts-guessing-question-multichoice	Closed-data	Model Completion	Black-box	Investigating Data Contamination in Modern Benchmarks for Large Language Models (link), Section 3.2.2

vLLM

The following methods require to launch a vLLM instance which will handle model inference:

Method
guided-prompting
min-prob
cdd
ts-guessing-question-based
ts-guessing-question-multichoice

To launch the instance, first run the following command in a terminal:

sh llmsanitize/scripts/vllm_hosting.sh

You are required to specify a port number and model name in this shell script.

Run Contamination Detection

To run contamination detection, you can follow the multiple test scripts in scripts/tests/ folder.

For instance, to run sharded-likelihood on Hellaswag with Llama-2-7B:

sh llmsanitizescripts/tests/closed_data/sharded-likelihood/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder>

To run a method using vLLM like guided-prompting for instance, the only difference is to pass the port number as argument:

sh llmsanitizescripts/tests/closed_data/guided-prompting/test_hellaswag.sh -m <path_to_your_llama-2-7b_folder> -p <port_number_from_your_vllm_instance>

Or, since llmsanitize has been installed as a Python package, you can call the detection methods directly in your Python script:

from llmsanitize ClosedDataContaminationChecker
args = <setup your argparse here>
contamination_checker = ClosedDataContaminationChecker(args)
contamination_checker.run_contamination("guided-prompting") # make sure that your args contain all parameters relevant this specific method

Citation

If you find our paper or this project helps your research, please kindly consider citing our paper in your publication.

@article{ravaut2024much,
  title={How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library},
  author={Ravaut, Mathieu and Ding, Bosheng and Jiao, Fangkai and Chen, Hailin and Li, Xingxuan and Zhao, Ruochen and Qin, Chengwei and Xiong, Caiming and Joty, Shafiq},
  journal={arXiv preprint arXiv:2404.00699},
  year={2024}
}

ntunlp / LLMSanitize