yonsei-sslab / Language_Model_Memorization

🚨 Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to Run

  1. (Optional) Change model type and hyperparameters at config.yaml
  2. Text sampling from the victim language model
    • Run python inference.py for single-gpu generation from the victim language model.
    • Run python parallel_inference.py for faster generation from the victim language model.
  3. Run python rerank.py to retrieve possibly memorized text sequence candidates

References

Contribution

  • Prevents oversampling during the prefix selection
  • Speeds up the inference with parallel Multi-GPU usage (only for gpt2-large)
  • Clears up GPU VRAM memory usage after the corresponding task
  • Rules out 'low-quality repeated generations' with repetition penalty and with ngram restriction
  • Supports T5 Encoder-Decoder as the victim model
  • Speeds up the reranking with parallel Multi-GPU usage

About

🚨 Implementation of the paper "Extracting Training Data from Large Language Models"(Carlini et al, 2020)

License:MIT License


Languages

Language:Python 100.0%