Code generation

Question

Code generation

6r1d opened this issue 3 years ago · comments

Hello. As far as I understand, and correct me if I'm wrong, it's possible to add more sample texts on the topic to the new versions of The Pile.

This is a list I can think of right now. It is probably very wrong in regards to many things, I have no experience in preparing such datasets. If scraping some of the sites will be considered useful, I can try to help.

Rosetta Code as a set of algorithm implementation examples
A bunch of other sites, similar to RosettaCode
Some examples from freeCodeCamp, potentially
YouMightNotNeedJquery, although it might require adding explanations along the way
To be honest, my understanding on having a good dataset for browser things is very, very vague: GPT-3 created React stuff 1, 2 already
Linux kernel, maybe just some modules
Graphics — VKGuide and Vulkan Tutorial for Vulkan, NeHE for OpenGL code
Shaders — shadertoy.com as a collection of shaders (if it's allowed), 2D SDFs, 3D SDFs for understanding shapes, ray-surface intersectors to improve it
Sound algorithms — musicdsp.org as a set of references, sndkit and Soundpipe as a set of implementations, VCV Rack's "fundamental" part for more implementations of sound algorithms
OSDev wiki

I'm sure I'm missing quite a few of good ideas here. There are many algorithm implementations inside the programming language code (Python batteries, for example), and there are many LibC implementations, to have a look at, as well.

UPD: I'm reading the paper "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" and I've noticed that GitHub and StackExchange were scraped already, though I'll leave the issue to discuss the other sites. It's not much, but I think those will be nice to have.

Travis Hoppe · Answer 1 · Tue Apr 27 2021 00:25:11 GMT+0800 (China Standard Time)

At the moment, I don't think new additions are being accepted (@StellaAthena would know more). What helped us though when we were designing The Pile was to determine the size and quality of each dataset before we started scrapping. For those that you listed, getting a rough estimate on useable text size (in terms of GB) would be a great first place to start for evaluation