EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Eye

Robbie-chew opened this issue · comments

The eye is a platform deicatded to archving any and all kinds of data.

They say they have 140 Tb in total in assorted formats and a good fraction seems to be in text format.

https://the-eye.eu/public/

unfortuatly due to the fact that all of their size estimates seem to be "pending update" it is dificult to give exact estimats on how much of this is textual

I believe the team has contacted folks at the Eye. The Bibliotik component is from them. Do they have other big text datasets that you know of?

Indeed, we are in contact with them and have gotten datasets from them. Long term we are working on hosting a copy of all of the data in the Pile on their systems.

Are there any specific datasets you recommend?

Okay. I’m going to tentatively close this issue, but feel free to suggest additional data sets in the future either on GitHub or on Discord.