fattorib / wtf-wikipedia-python

raw wikipedia XML to LM_Dataformat in under 4 hours

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wikipedia to LM_Dataformat

Python script for creating LM_Dataformat dataset from a Mongo created with wtf_wikipedia and dumpster-dive

Instructions

(Steps 0 and 1 copied from dumpster-dive)

  1. Install nodejs (at least v6), mongodb (at least v3)

  2. Install dumpster-dive:

# install this script
npm install -g dumpster-dive # (that gives you the global command `dumpster`)
# start mongo up
mongod --config /mypath/to/mongod.conf
  1. Install Python requirements:
pip install -r requirements
  1. Download and extract your copy of Wikitext. Make sure you have plenty of extra space when you do this. For the 20221006 dump, the uncompressed XML file is ~91GB!

  2. Load the XML to Mongo:

dumpster ./enwiki-latest-pages-articles.xml --plaintext=true --infoboxes=false --citations=false --categories=false --links=false

For our extract, we skip the following sections as they usually contain little-to-no actual text content:

  • infoboxes
  • redirects
  • disambiguations
  • citations
  • links

On a modern desktop CPU this process takes around 90 minutes.

  1. Stream data from Mongo to LM_Dataformat:
python stream_db.py

This process takes around 25 minutes.

Content Filtration Used

Quick Summary of content filtration used:

Wherever possible, article filtering follows the methodology of Wiki-40B: Multilingual Language Model Dataset by Guo et al:

  • Sections like 'References', 'See Also', and 'Further Reading' are excluded from the dataset.
  • Lists, Links, Images, Captions and Tables are excluded from the dataset.
  • Disambiguation Pages and Redirect Pages are excluded from the dataset.

There is also some additional filtration to filter out content that wtf_wikipedia doesnt filter for:

  • As a proxy for removing Non-entity sections in Guo et al, the majority of articles with titles starting with 'List of' are skipped. Most of these articles are lists providing almost no text content (ex: this and this). To catch cases where an article starts with 'List of' but still contains a significant of high-quality text (ex: this and this), all 'List of' titles are filtered against Wikipedia's List of Featured Lists as these articles contain more content than just the list values themselves. See here for the full criteria of a Featured List.

  • Sections containing 5 or fewer words to no text content are skipped as these sections are usually lists with some text preamble.

  • At the article level, formatting follows PileV1 Wikipedia. Titles, sections and paragraphs are joined together with \n\n.

Pile V2 Stats:

  • Wikipedia Dump Date: 2022-10-06
  • Number of Included Articles: 6100633
  • Archive Size (Jsonlines): 15.8GB

About

raw wikipedia XML to LM_Dataformat in under 4 hours


Languages

Language:Python 100.0%