Move processing code to this repo

Question

Move processing code to this repo

StellaAthena opened this issue 4 years ago · comments

Having a whole bunch of repositories scattered across GitHub for processing code is no beuno. We should really make a directory in this repo for housing them. If people want to keep theirs off-repo that's fine, but I really don't see why we shouldn't house them here.

I've assigned people who have been loud about this in the past to this issue.

Travis Hoppe · Answer 1 · Thu Sep 17 2020 02:56:04 GMT+0800 (China Standard Time)

A sub-directory for each project would be great. I don't think we should pollute the main EleutherAI github namespace for each data pull, especially since some of the pull codes are rather small. However, it might be nice to have a repo just for data processing -- this way the development of the Pile itself can proceed independent of processing and adding a contributor for commits makes more sense than here.

Stella Biderman · Answer 2 · Thu Sep 17 2020 03:22:29 GMT+0800 (China Standard Time)

@thoppe I think there’s something called “subrepositories” on GitHub. To be clear, you just mean making a directory and putting the code in it? I would actually recommend a two-layer system:

the-pile/
— data processing/
— — Wikipedia/
— — — main.py
— — arXiv/
— — — main.py

Some of the processing code is not all in one file, which is why I’m recommending this. We can look at consolidating each script into a single file though, if people dislike the added layer. (I know @leogao2 has strong opinions about the number of clicks to get to things).

Stella Biderman · Answer 3 · Thu Sep 17 2020 03:26:39 GMT+0800 (China Standard Time)

However, it might be nice to have a repo just for data processing -- this way the development of the Pile itself can proceed independent of processing and adding a contributor for commits makes more sense than here.

This is interesting, and something I hadn’t considered. I’m not sure how much sense it makes though... the-pile proper is closely tied in with the data processing code. What would it look like for the-pile and the data processing to diverge? Isn’t this the data processing for the pile?

Travis Hoppe · Answer 4 · Thu Sep 17 2020 05:17:13 GMT+0800 (China Standard Time)

Sorry that wasn't clear, your statement of a "two-layer system" is what I had envisioned. 👍

Furthermore, I suggest it be moved to it's own repo. 1] For consolidation, right now there are scripts all over the place 2] if data processing had its own repo, we, the data collectors, could push to it without regard to the final stage of the pipeline (which is what The-Pile looks like). Permissions for the data collection repo can be more permissive than this one. Since @bmk is managing the final stages of the data, it might be useful and less cognitive load split them up.

Stella Biderman · Answer 5 · Sun Nov 15 2020 22:40:20 GMT+0800 (China Standard Time)

We have decided that we will copy processing code into the EleutherAI GitHub but not into this directory specifically. We may make a “data processing” directory that contains each data processing code base as a sub directory in the future.