This project is part of EleutherAI's quest to create a massive repository of high quality text data for training language models.
Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons.
The plug and play version of OpenWebText2 contains:
- 17,103,059 documents
- 65.86GB uncompressed text
Download Dataset / Documentation
For further information please visit our documentation.
researcher2 Wrote much of this code, with inspiration and some straight copying of the scraping code found here.
sdtblck kindly put together the Colab notebook, and performed a chunk of the scraping.
leogao2 provided overall design guidance, lm_dataformat, and performed another chunk of scraping.
Colaboratory VMs helped us with about 10% of our overall scraping.
The Eye host our processed datasets.
Read The Docs host our documentation.