EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggested corpus: Adult stories

johnflux opened this issue · comments

I have corpus of ~10GB of adult stories, in English, in plain text, taken primarily from asstr.org and literotica.
I think it would be interesting to incorporate these into the training set as well.

commented

@johnflux I would look in the Pile paper, page 22, excluded datasets.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

One of your datasources is directly named and excluded there, and the other one, probably follows the same rationale. Their reasons for excluding these were much different from the reasons for which I would have excluded them were it my choice (my rationale is x in, x out -> where x = {copyright infringement, nsfw content}), but they had a more scientific rationale you can read there.