EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code generation

6r1d opened this issue · comments

Hello. As far as I understand, and correct me if I'm wrong, it's possible to add more sample texts on the topic to the new versions of The Pile.

This is a list I can think of right now. It is probably very wrong in regards to many things, I have no experience in preparing such datasets. If scraping some of the sites will be considered useful, I can try to help.

I'm sure I'm missing quite a few of good ideas here. There are many algorithm implementations inside the programming language code (Python batteries, for example), and there are many LibC implementations, to have a look at, as well.

UPD: I'm reading the paper "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" and I've noticed that GitHub and StackExchange were scraped already, though I'll leave the issue to discuss the other sites. It's not much, but I think those will be nice to have.

At the moment, I don't think new additions are being accepted (@StellaAthena would know more). What helped us though when we were designing The Pile was to determine the size and quality of each dataset before we started scrapping. For those that you listed, getting a rough estimate on useable text size (in terms of GB) would be a great first place to start for evaluation