p-lambda / dsir

DSIR large-scale data selection framework for language model training

Home Page:https://arxiv.org/abs/2302.03169

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Heuristic classification method selections

mataney opened this issue · comments

Great work!
Thanks for releasing your DSIR-selected dataset through huggingface.

Any plans to release the selected-dataset according to the Heuristic classification method as well?

That will be great for reproducing efforts.

Hi, sorry for the delay. The heuristic classification dataset is at https://huggingface.co/datasets/stanford-crfm/heuristic_classification-filtered-pile-50M.

Note that I've also updated the DSIR-selected dataset to merge the validation and test set into the train, which reflects how it was used in the paper. The heuristic classification dataset also only has a train set. If you want to recover the previous validation and test, they should be the last two 50k chunks of the train set.