ethz-privsec / lm-extraction-benchmark-data

Datasets for the SATML 2023 competition on training data extraction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Datasets for the SATML 2023 challenge on training data extraction

This repository contains the raw datasets for the Training Data Extraction Challenge organized at SaTML 2023.

The main repository provides the challenge data as a list of pointers into The Pile.

To save participants the need for downloading and decompressing 800GB of text, you can find the raw numpy files here:

Train

Val

Will be added once the validation set is released.

Test

Will be added once the validation set is released.

About

Datasets for the SATML 2023 competition on training data extraction

License:Apache License 2.0