dmh43 / dumpster-dive

roll a wikipedia dump into mongo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dumpster-dive

wikipedia dump parser
by Spencer Kelly, Devrim Yasar, and others

gets a wikipedia xml dump into mongo,
so you can mess-around.

💂 Yup 💂

do it on your laptop.

dumpster-dive is a nodejs script that puts a highly-queryable wikipedia on your computer in a nice afternoon.

It uses worker-nodes to process pages in parallel, and wtf_wikipedia to turn wikiscript into whatever json.

-- en-wikipedia takes about 5-hours, end-to-end --

dumpster

npm install -g dumpster-dive

😎 API

var dumpster = require('dumpster-dive')
dumpster({ file:'./enwiki-latest-pages-articles.xml', db:'enwiki'}, callback)

Command-Line:

dumpster /path/to/my-wikipedia-article-dump.xml --citations=false --html=true

then check out the articles in mongo:

$ mongo        #enter the mongo shell
use enwiki     #grab the database
db.pages.count()
# 4,926,056...
db.pages.find({title:"Toronto"})[0].categories
#[ "Former colonial capitals in Canada",
#  "Populated places established in 1793" ...]

Steps:

1️⃣ you can do this.

you can do this. just a few Gb. you can do this.

2️⃣ get ready

Install nodejs (at least v6), mongodb (at least v3)

# install this script
npm install -g dumpster-dive # (that gives you the global command `dumpster`)
# start mongo up
mongod --config /mypath/to/mongod.conf

3️⃣ download a wikipedia

The Afrikaans wikipedia (around 47,000 articles) only takes a few minutes to download, and 5 mins to load into mongo on a macbook:

# dowload an xml dump (38mb, couple minutes)
wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2

the english dump is 16Gb. The download page is confusing, but you'll want this file: ${LANG}wiki-latest-pages-articles.xml.bz2

4️⃣ unzip it

i know, this sucks. but it makes the parser so much faster.

bzip2 -d ~/path/afwiki-latest-pages-articles.xml.bz2

On a macbook, unzipping en-wikipedia takes an hour or so. This is the most-boring part. Eat some lunch.

The english wikipedia is around 60Gb.

5️⃣ OK, start it off

#load it into mongo (10-15 minutes)
dumpster ./afwiki-latest-pages-articles.xml

6️⃣ take a bath

just put some epsom salts in there, it feels great.

The en-wiki dump should take a few hours. Maybe 8. Should be done before dinner.

The console will update you every couple seconds to let you know where it's at.

7️⃣ done!

image

go check-out the data! to view your data in the mongo console:

$ mongo
use afwiki //your db name

//show a random page
db.pages.find().skip(200).limit(2)

//find a specific page
db.pages.findOne({title:"Toronto"}).categories

//find the last page
db.pages.find().sort({$natural:-1}).limit(1)

// all the governors of Kentucky
db.pages.count({ categories : { $eq : "Governors of Kentucky" }}

//pages without images
db.pages.count({ images: {$size: 0} })

alternatively, you can run dumpster-report afwiki to see a quick spot-check of the records it has created across the database.

Same for the English wikipedia:

the english wikipedia will work under the same process, but the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 13 GB (for enwiki-20170901-pages-articles.xml.bz2), and becomes a pretty legit mongo collection uncompressed. It's something like 51GB, but mongo can do it 💪.

Options:

dumpster follows all the conventions of wtf_wikipedia, and you can pass-in any fields for it to include in it's json.

  • human-readable plaintext --plaintext
dumpster({file:'./myfile.xml.bz2', db: 'enwiki', plaintext:true, categories:false})
/*
[{
  _id:'Toronto',
  title:'Toronto',
  plaintext:'Toronto is the most populous city in Canada and the provincial capital...'
}]
*/
  • disambiguation pages / redirects --skip_disambig, --skip_redirects by default, dumpster skips entries in the dump that aren't full-on articles, you can
let obj = {
	file: './path/enwiki-latest-pages-articles.xml.bz2',
	db: 'enwiki',
	skip_redirects: false,
	skip_disambig: false
}
dumpster(obj, () => console.log('done!') )
  • reducing file-size: you can tell wtf_wikipedia what you want it to parse, and which data you don't need:
dumpster ./my-wiki-dump.xml --infoboxes=false --citations=false --categories=false --links=false
  • custom json formatting you can grab whatever data you want, by passing-in a custom function. It takes a wtf_wikipedia Doc object, and you can return your cool data:
let obj={
	file: path,
	db: dbName,
	custom: function(doc) {
		return {
			_id: doc.title(),   //for duplicate-detection
			title: doc.title(), //for the logger..
			sections: doc.sections().map(i => i.json()),
			categories: doc.categories() //whatever you want!
		}
	}
}
dumpster(obj, () => console.log('custom wikipedia!') )
  • non-main namespaces: do you want to parse all the navboxes? change namespace in ./config.js to another number

how it works:

this library uses:

Addendum:

_ids

since wikimedia makes all pages have globally unique titles, we also use them for the mongo _id fields. The benefit is that if it crashes half-way through, or if you want to run it again, running this script repeatedly will not multiply your data. We do a 'upsert' on the record.

encoding special characters

mongo has some opinions on special-characters in some of its data. It is weird, but we're using this standard(ish) form of encoding them:

\  -->  \\
$  -->  \u0024
.  -->  \u002e

Non-wikipedias

This library should also work on other wikis with standard xml dumps from MediaWiki. I haven't tested them, but the wtf_wikipedia supports all sorts of non-standard wiktionary/wikivoyage templates, and if you can get a bz-compressed xml dump from your wiki, this should work fine. Open an issue if you find something weird.

PRs welcome!

This is an important project, come help us out.

About

roll a wikipedia dump into mongo

License:Other


Languages

Language:JavaScript 100.0%