EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Move processing code to this repo

StellaAthena opened this issue · comments

Having a whole bunch of repositories scattered across GitHub for processing code is no beuno. We should really make a directory in this repo for housing them. If people want to keep theirs off-repo that's fine, but I really don't see why we shouldn't house them here.

I've assigned people who have been loud about this in the past to this issue.

A sub-directory for each project would be great. I don't think we should pollute the main EleutherAI github namespace for each data pull, especially since some of the pull codes are rather small. However, it might be nice to have a repo just for data processing -- this way the development of the Pile itself can proceed independent of processing and adding a contributor for commits makes more sense than here.

@thoppe I think there’s something called “subrepositories” on GitHub. To be clear, you just mean making a directory and putting the code in it? I would actually recommend a two-layer system:

the-pile/
— data processing/
— — Wikipedia/
— — — main.py
— — arXiv/
— — — main.py

Some of the processing code is not all in one file, which is why I’m recommending this. We can look at consolidating each script into a single file though, if people dislike the added layer. (I know @leogao2 has strong opinions about the number of clicks to get to things).

However, it might be nice to have a repo just for data processing -- this way the development of the Pile itself can proceed independent of processing and adding a contributor for commits makes more sense than here.

This is interesting, and something I hadn’t considered. I’m not sure how much sense it makes though... the-pile proper is closely tied in with the data processing code. What would it look like for the-pile and the data processing to diverge? Isn’t this the data processing for the pile?

Sorry that wasn't clear, your statement of a "two-layer system" is what I had envisioned. 👍

Furthermore, I suggest it be moved to it's own repo. 1] For consolidation, right now there are scripts all over the place 2] if data processing had its own repo, we, the data collectors, could push to it without regard to the final stage of the pipeline (which is what The-Pile looks like). Permissions for the data collection repo can be more permissive than this one. Since @bmk is managing the final stages of the data, it might be useful and less cognitive load split them up.

We have decided that we will copy processing code into the EleutherAI GitHub but not into this directory specifically. We may make a “data processing” directory that contains each data processing code base as a sub directory in the future.