Datasets for the SATML 2023 challenge on training data extraction
This repository contains the raw datasets for the Training Data Extraction Challenge organized at SaTML 2023.
The main repository provides the challenge data as a list of pointers into The Pile.
To save participants the need for downloading and decompressing 800GB of text, you can find the raw numpy files here:
Train
- train_prefix.npy (1.4 MB)
- train_suffix.npy (1.4 MB)
- train_preprefix.npy (2.9 MB)
- train_dataset.npy (5.7 MB)
Val
Will be added once the validation set is released.
Test
Will be added once the validation set is released.