bigcode-project / the-stack-v2

Code for the curation of The Stack v2 and StarCoder2 training data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Stack v2 & StarCoder2Data

In this repository you can find the code for building The Stack v2 dataset, as well as the extra sources used to make StarCoder2data: the training corpus of the StarCoder2 family of models.

This reposirory is a follow-up of on the work in bigcode-dataset used for The Stack v1 and StarCoderData.

About

Code for the curation of The Stack v2 and StarCoder2 training data

License:Apache License 2.0


Languages

Language:Jupyter Notebook 50.0%Language:Python 47.6%Language:Shell 2.4%