dumpster-dip is a script that allows you to parse a wikipedia dump into ad-hoc data.
dumpster-dive is a script that puts it into mongodb, instead.
use whatever you prefer!
1. Download a dump
cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2
bzip2 -d ./path/to/enwiki-latest-pages-articles.xml.bz2
npm install dumpster-dip
import dip from 'dumpster-dip'
const opts = {
input: '/path/to/my-wikipedia-article-dump.xml',
parse: function(doc) {
return doc.sentences()[0].text()// return the first sentence of each page
}
}
// this promise takes ~4hrs
dip(opts).then(() => {
console.log('done!')
})
en-wikipedia takes about 4hrs on a macbook.
This tool is intended to be a clean way to pull random bits out of wikipedia, like:
'all the birthdays of basketball players'
await dip({
doPage: function(doc){ return doc.categories().find(cat => cat === `American men's basketball players`)},
parse: function(doc){ return doc.infobox().get('birth_date')}
})
It uses wtf_wikipedia as the wikiscript parser.
By default, it outputs an individual file for every wikipedia article. Sometimes operating systems don't like having ~6m files in one folder, though - so it nests them 2-deep, using the first 4 characters of the filename's hash:
/BE
/EF
/Dennis_Rodman.txt
/Hilary_Clinton.txt
as a helper, this library exposes a function for navigating this directory scheme:
import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt
This is the same scheme that wikipedia does internally.
to put files in folders indexed by their first letter, do:
let opts = {
outputDir: './results',
outputMode: 'encyclopedia',
}
this is less ideal, because some directories become way larger than others. Also remember that titles are UTF-8.
For two-letter folders, use outputMode: 'encyclopedia-two'
if you want all files in one flat directory, you can do:
let opts = {
outputDir: './results',
outputMode: 'flat',
}
if you want all results in one file, you can do:
let opts = {
outputDir: './results',
outputMode: 'ndjson',
}
let opts = {
// directory for all our new files
outputDir: './results', // (default)
// how we should write the results
outputMode: 'nested', // (default)
// which wikipedia namespaces to handle (null will do all)
namespace: 0, //(default article namespace)
// define how many concurrent workers to run
workers: cpuCount, // default is cpu count
//interval to log status
heartbeat: 5000, //every 5 seconds
// parse redirects, too
redirects: false, // (default)
// parse disambiguation pages, too
disambiguation: true, // (default)
// allow a custom wtf_wikipedia parsing library
libPath: 'wtf_wikipedia', // (default)
// should we skip this page or return something?
doPage: function(doc){ return true}, // (default)
// what do return, for every page
parse: function(doc){return doc.json()}, // (default) - avoid using an arrow-function
}
Given the parse
callback, you're free to return anything you'd like. Sometimes though, you may want to parse a page with a custom version of wtf_wikipedia
parser - if you need any extra plugins or functionality.
Here we apply a custom plugin to our wtf lib, and pass it in to be available each worker:
in ./myLib.js
import wtf from 'wtf_wikipedia'
// add custom analysis as a plugin
wtf.extend((models, templates)=>{
// add a new method
models.Doc.prototype.firstSentence = function(){
return this.sentences()[0].text()
}
// support a missing plugin
templates.pingponggame = function(tmpl, list){
let arr = tmpl.split('|')
return arr[1] + ' to '+ arr[2]
}
})
export default wtf
then we can pass this version into dumpster-dip:
import dip from 'dumpster-dip'
dip({
input: '/path/to/dump.xml',
libPath:'./myLib.js', // our version
parse: function(doc) {
return doc.firstSentence() // use custom method
}
})
MIT