prescod / the-xml-document-stack

Download the portions of The Stack relevant to XML-as-document-markup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

the-xml-document-stack

  1. Set up a venv with requirements.txt

  2. Set up hugging face auth with the huggingface-cli login

  3. Download HuggingFace documents onto cache and symlink in the dataset_bin directory like this:

    python download-xml-from-stack.py

    Or you can download a smaller subset like this:

    python download-xml-from-stack.py 100

  4. Extract relevant XML files into a subdirectory like this:

    python find_xml_in_the_stack.py dataset_bin/data/xml/train-00*

NOTE: Even if you delete the symlinks in dataset_bin the files will still exist in ~/.cache/huggingface/ !!!! They will take up 78 GB until you get rid of them!

About

Download the portions of The Stack relevant to XML-as-document-markup

License:MIT License


Languages

Language:Python 100.0%