Books & Documents:
https://huggingface.co/datasets/the_pile_books3
Description: This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.
This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1).
On s3: Not yet.
Converted to training format: not yet
https://the-eye.eu/libraries.html
Description: libgen & zlib
On s3: yes Converted to training format: not yet
https://archive.org/details/fanfictiondotnet_repack
https://archive.org/details/Fanfictiondotnet1011dump
fanfiction.net ID 11M+ should get scraped
Description: dump of fanfiction.net Many short stories, books, ...
On s3: Yes
Converted to training format: not yet
https://the-eye.eu/public/Random/torrents/archiveorg_DjVuTXT_Part1.torrent
Description: 16 M ebooks from IA
On s3: Not Yet
Converted to training format: not yet
https://the-eye.eu/public/Books/ Description: 5+M ebooks from different domains
On s3: Not Yet
Converted to training format: not yet
all ebook torrents from piratebay: https://pirate-bays.net/search?q=ebooks Description: many differentr ebook torrents
On s3: Not Yet
Converted to training format: not yet
https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/Miscellaneous%20Texts/
https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/PDF/
Description: many TV captions / subtitles - need to be checked
On s3: Yes.
Converted to training format: not yet
https://huggingface.co/datasets/bookcorpusopen
https://huggingface.co/datasets/demelin/moral_stories
Description: many TV captions / subtitles - need to be checked
On s3: Yes.
Converted to training format: not yet
Largescale Webtext:
https://huggingface.co/datasets/oscar
https://huggingface.co/datasets/mc4
https://huggingface.co/datasets/the_pile
https://huggingface.co/datasets/spanish_billion_words
https://huggingface.co/datasets/arabic_billion_words
https://huggingface.co/datasets/olm/wikipedia
https://huggingface.co/datasets/cc100
https://files.pushshift.io/reddit/comments/ https://arxiv.org/abs/2001.08435
Description: Reddit comments dumps
On s3: Not yet
Converted to training format: not yet
https://the-eye.eu/public/social/twitter/
Code: https://huggingface.co/datasets/bigcode/the-stack-dedup
https://huggingface.co/datasets/code_search_net
https://huggingface.co/datasets/codeparrot/github-code
Law: https://openreview.net/forum?id=3HCT3xfNm9r https://huggingface.co/datasets/pile-of-law/pile-of-law
Scientific papers:
Translation: https://huggingface.co/datasets/opus100
Public domain ebooks (67k) categorised by Locc, multilingual collection: https://www.gutenberg.org/ebooks/