catalogue_data

Scripts to prepare catalogue data.

Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

sudo apt-get install git-lfs
git lfs install

Install dependencies:

sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar

Create virtual environment, activate it and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token and set environment variables in the .env file at the root directory:

HF_USERNAME=<Replace with your Hugging Face username>
HF_USER_ACCESS_TOKEN=<Replace with your Hugging Face API token>
GIT_USER=<Replace with your Git user>
GIT_EMAIL=<Replace with your Git email>

Create metadata

To create dataset metadata (in file dataset_infos.json) run:

python create_metadata.py --repo <repo_id>

where you should replace <repo_id>, e.g. bigscience-catalogue-lm-data/lm_ca_viquiquad

Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

python aggregate_datasets.py --dataset_ratios_path <path_to_file_with_dataset_ratios> --save_path <dir_path_to_save_aggregated_dataset>

where you should replace:

path_to_file_with_dataset_ratios: path to JSON file containing a dict with dataset names (keys) and their ratio (values) between 0 and 1.
<dir_path_to_save_aggregated_dataset>: directory path to save the aggregated dataset

Downloads for cleaning

Stanza

import stanza

for lang in {"ar", "ca", "eu", "id", "vi", "zh-hans", "zh-hant"}:
    stanza.download(lang, logging_level="WARNING")

Indic NLP library

git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
export INDIC_RESOURCES_PATH=<PATH_TO_REPO>

NLTK

import nltk nltk.download("punkt")

bigscience-workshop / catalogue_data