Datasets for the SATML 2023 challenge on training data extraction

This repository contains the raw datasets for the Training Data Extraction Challenge organized at SaTML 2023.

The main repository provides the challenge data as a list of pointers into The Pile.

To save participants the need for downloading and decompressing 800GB of text, you can find the raw numpy files here:

Train

Will be added once the validation set is released.

Will be added once the validation set is released.

Datasets for the SATML 2023 competition on training data extraction

Apache License 2.0