- Get a machine with lots of CPUs and memory. We use an n1-standard-96 Ubuntu 20.04 LTS machine on GCP. Add Terabytes of disk space too.
- Install cargo (rust package manager) with
curl https://sh.rustup.rs -sSf | sh
. Then install Ungoliant withcargo install ungoliant@1.2.3
. You may need to install gcc and cmake first. - Set up a Python 3.9 environment, and run
pip install -r requirements.txt
- Run
huggingface-cli login
(should have been installed in the requirements.txt) and then paste a token from your account at https://huggingface.co. This is necessary because the pipeline will push the finalized datasets to your Hugging Face account.
Follow the instructions at olm_pipeline_scripts/common_crawl.
Follow the instructions at olm_pipeline_scripts/wikipedia.