sandorkonya/interesting-text-datasets

Books & Documents:

https://huggingface.co/datasets/the_pile_books3

Description: This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.

This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1).

On s3: Not yet.

Converted to training format: not yet

https://the-eye.eu/libraries.html

Description: libgen & zlib

On s3: yes Converted to training format: not yet

https://archive.org/details/fanfictiondotnet_repack

https://archive.org/details/Fanfictiondotnet1011dump

fanfiction.net ID 11M+ should get scraped

Description: dump of fanfiction.net Many short stories, books, ...

On s3: Yes

Converted to training format: not yet

https://the-eye.eu/public/Random/torrents/archiveorg_DjVuTXT_Part1.torrent

Description: 16 M ebooks from IA

On s3: Not Yet

Converted to training format: not yet

https://the-eye.eu/public/Books/ Description: 5+M ebooks from different domains

On s3: Not Yet

Converted to training format: not yet

all ebook torrents from piratebay: https://pirate-bays.net/search?q=ebooks Description: many differentr ebook torrents

On s3: Not Yet

Converted to training format: not yet

https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/Miscellaneous%20Texts/

https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/PDF/

Description: many TV captions / subtitles - need to be checked

On s3: Yes.

Converted to training format: not yet

https://huggingface.co/datasets/bookcorpusopen

https://huggingface.co/datasets/demelin/moral_stories

Subs: https://the-eye.eu/public/Random/archive.org_dumps/archive.org_tvarchive_CaptionProject_December1st2022.tar.zst

Description: many TV captions / subtitles - need to be checked

On s3: Yes.

Converted to training format: not yet

Largescale Webtext:

https://huggingface.co/datasets/oscar

https://huggingface.co/datasets/mc4

https://huggingface.co/datasets/the_pile

https://huggingface.co/datasets/spanish_billion_words

https://huggingface.co/datasets/arabic_billion_words

https://huggingface.co/datasets/olm/wikipedia

https://huggingface.co/datasets/cc100

https://files.pushshift.io/reddit/comments/ https://arxiv.org/abs/2001.08435

Description: Reddit comments dumps

On s3: Not yet

Converted to training format: not yet

https://the-eye.eu/public/social/twitter/

Code: https://huggingface.co/datasets/bigcode/the-stack-dedup

https://huggingface.co/datasets/code_search_net

https://huggingface.co/datasets/codeparrot/github-code

Law: https://openreview.net/forum?id=3HCT3xfNm9r https://huggingface.co/datasets/pile-of-law/pile-of-law

Scientific papers:

Translation: https://huggingface.co/datasets/opus100

Public domain ebooks (67k) categorised by Locc, multilingual collection: https://www.gutenberg.org/ebooks/

sandorkonya / interesting-text-datasets

About