language-dataset

A dataset for programming language identification.

Methodology

Available languages are fetched from github/linguist's languages.yml and acmeism/RosettaCodeData's Lang.yaml.
For each language, initial samples are fetched from GitHub as follows:
- GitHub Search API is used to get a list of repositories.
- Each repository is cloned and languages are detected with github/linguist.
- One sample is added from each repository.
Samples are later reviewed by humans.

Rules for sample inclusion are:

The dataset is stored in the data directory. It contains:

meta.yml: metadata about the dataset and available languages.
dataset.yml: collection of all samples, with pointers sample paths relative to data.

Check a summary of the dataset at REPORT.md.

The tools directory contains various Python utilities to maintain the dataset:

tools/gen_meta.py: Generates data/meta.yml. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.
tools/harvest.py: Fetches samples from GitHub.
tools/vote.py: Updates the vote annotation.
tools/lint.py: Checks the dataset for potential problems.
tools/prepare_commit.py: Updates generated files, required before any commit.
tools/classify_linguist.py: Updates linguist labels.
tools/classify_pygments.py: Updates pygments labels.

To run tools first create the virtual environment:

pip install poetry
poetry install

Then run the tool with python -m:

poetry run python -m tools.gen_meta

Each sample in data has its own license. Check the origin repository for details.

Everything else is licensed under the MIT License.

Dataset for programming language identification.

MIT License

Language:Python 97.3%Language:Ruby 2.7%