- Clone the repository and its submodules
git clone --recurse-submodules -j8 git@github.com:mttk/wiki_preproc.git
- Install the requirements
pip install -r requirements.txt
- Download he spacy model for the language you want
python -m spacy download en
- Download and store the wikipedia dump (from https://dumps.wikimedia.org/enwiki/ for English dumps)
- ex.
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
for the latest dump of english wiki
- Modify the paths and parameters in
clean_wiki.sh
and run the script./clean_wiki.sh
The output provided by this tool is a sentence-level tokenized dump of wikipedia. Wikipedia articles are separated by empty lines, and each line is a separate sentence. The output format is in line with BERT input and can be used with repositories such as pytorch-pretrained-BERT.