PlainTextWikipedia

Convert Wikipedia database dumps into plain text files (JSON). This can parse literally all of Wikipedia with pretty high fidelity. There's a copy available on Kaggle Datasets

Instructions

Download all the .bz2 files from a dump: https://dumps.wikimedia.org/enwiki/ The filename should look like enwiki-20201120-pages-articles-multistream1.xml-p1p41242.bz2
Unzip all bz2 files directly to another directory, such as WikipediaArchive
Install REQUIREMENTS.TXT
Update the source and destination directory variables in jsonify_wikipedia.py
Run the script jsonify_wikipedia.py

This will deposit a ~40MB JSON files into the destination folder. Each filename is guaranteed to be completely unique as it is based on UUIDv4.

File Schema

Each file is a JSON object with the root element as a list. Each dictionary within the list has only 3 keys: id, title and text. The ID field comes from the wikipedia article ID. The title and text are the page title and plain-text parsed article respectively. An example follows.

[
 {
  "id": "17279752",
  "text": "Hawthorne Road was a cricket and football ground in Bootle in England...",
  "title": "Hawthorne Road"
 }
]

Legal

https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content

Wikipedia is published under Creative Commons Attribution Share-Alike license (CC-BY-SA).

My script is published under the MIT license but this does not confer the same privileges to the material you convert with it.

Future Improvements

Maintain some article structure in JSON format
Better demarkup handling
Better retention of link and image context

davidde / PlainTextWikipedia

PlainTextWikipedia

Instructions

File Schema

Legal

Future Improvements

About

Languages