smola / language-dataset

Dataset for programming language identification.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

language-dataset

A dataset for programming language identification.

Methodology

Rules for sample inclusion are:

  • No more than one sample from each repository.
  • Sample is at least 500b and at most 100kb.

Dataset

The dataset is stored in the data directory. It contains:

  • meta.yml: metadata about the dataset and available languages.
  • dataset.yml: collection of all samples, with pointers sample paths relative to data.

Check a summary of the dataset at REPORT.md.

Contributing

See CONTRIBUTING.md.

Tooling

The tools directory contains various Python utilities to maintain the dataset:

  • tools/gen_meta.py: Generates data/meta.yml. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.
  • tools/harvest.py: Fetches samples from GitHub.
  • tools/vote.py: Updates the vote annotation.
  • tools/lint.py: Checks the dataset for potential problems.
  • tools/prepare_commit.py: Updates generated files, required before any commit.
  • tools/classify_linguist.py: Updates linguist labels.
  • tools/classify_pygments.py: Updates pygments labels.

To run tools first create the virtual environment:

pip install poetry
poetry install

Then run the tool with python -m:

poetry run python -m tools.gen_meta

License

Each sample in data has its own license. Check the origin repository for details.

Everything else is licensed under the MIT License.

About

Dataset for programming language identification.

License:MIT License


Languages

Language:Python 97.3%Language:Ruby 2.7%