EleutherAI / stackexchange-dataset

Python tools for processing the stackexchange data dumps into a text dataset for Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

stackexchange_dataset

A python tool for downloading & processing the stackexchange data dumps into a text dataset for Language Models.

Download the whole processed dataset here

Setup

git clone https://github.com/EleutherAI/stackexchange_dataset/
cd stackexchange_dataset
pip install -r requirements.txt

Usage

To download every stackexchange dump & parse to text, simply run

python3 main.py --names all

To download only a single stackexchange, you can add the name as an optional argument. E.G:

python3 main.py --names security.stackexchange

To download a list of multiple stackexchanges, you can add the names separated by commas. E.G:

python3 main.py --names ru.stackoverflow,money.stackexchange

The name should be the url of the stackoverflow site, minus http(s):// and .com. You can view all available stackoverflow dumps here.

All Usage Options:

usage: main.py [-h] [--names NAMES]

CLI for stackexchange_dataset - A tool for downloading & processing
stackexchange dumps in xml form to a raw question-answer pair text dataset for
Language Models

optional arguments:
  -h, --help     show this help message and exit
  --names NAMES  names of stackexchanges to download, extract & parse,
                 separated by commas. If "all", will download, extract & parse
                 *every* stackoverflow site

TODO:

  • should we add metadata to the text (i.e name of stackexchange & tags)?
  • add flags to change min_score / max_responses args.
  • add flags to turn off downloading / extraction
  • add flags to select number of workers for multiprocessing
  • output as lm dataformat

About

Python tools for processing the stackexchange data dumps into a text dataset for Language Models

License:MIT License


Languages

Language:Python 100.0%