Heuristic classification method selections
mataney opened this issue · comments
Great work!
Thanks for releasing your DSIR-selected dataset through huggingface.
Any plans to release the selected-dataset according to the Heuristic classification method as well?
That will be great for reproducing efforts.
Hi, sorry for the delay. The heuristic classification dataset is at https://huggingface.co/datasets/stanford-crfm/heuristic_classification-filtered-pile-50M.
Note that I've also updated the DSIR-selected dataset to merge the validation and test set into the train, which reflects how it was used in the paper. The heuristic classification dataset also only has a train set. If you want to recover the previous validation and test, they should be the last two 50k chunks of the train set.