spencermountain / dumpster-dip

parse a wikipedia dump into tiny files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dumpster-dip

wikipedia dump parser
by Spencer Kelly, Devrim Yasar, and others

gets a wikipedia xml dump into tiny json files,
so you can get a bunch of easy data.

👍 〰〰〰〰〰〰〰〰 👍

dumpster-dip is a script that allows you to parse a wikipedia dump into ad-hoc data.

dumpster-dive is a script that puts it into mongodb, instead.

use whatever you prefer!

1. Download a dump
cruise the wikipedia dump page and look for ${LANG}wiki-latest-pages-articles.xml.bz2

2. Unzip the dump

bzip2 -d ./path/to/enwiki-latest-pages-articles.xml.bz2

3. Start the javascript

npm install dumpster-dip

import dip from 'dumpster-dip'

const opts = {
  input: '/path/to/my-wikipedia-article-dump.xml',
  parse: function(doc) {
    return doc.sentences()[0].text()// return the first sentence of each page
  }
}
// this promise takes ~4hrs
dip(opts).then(() => {
  console.log('done!')
})

en-wikipedia takes about 4hrs on a macbook.


This tool is intended to be a clean way to pull random bits out of wikipedia, like:

'all the birthdays of basketball players'

await dip({
  doPage: function(doc){ return doc.categories().find(cat => cat === `American men's basketball players`)},
  parse: function(doc){ return doc.infobox().get('birth_date')}
})

It uses wtf_wikipedia as the wikiscript parser.

Outputs:

By default, it outputs an individual file for every wikipedia article. Sometimes operating systems don't like having ~6m files in one folder, though - so it nests them 2-deep, using the first 4 characters of the filename's hash:

/BE
  /EF
    /Dennis_Rodman.txt
    /Hilary_Clinton.txt

as a helper, this library exposes a function for navigating this directory scheme:

import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt

This is the same scheme that wikipedia does internally.

to put files in folders indexed by their first letter, do:

let opts = {
  outputDir: './results', 
  outputMode: 'encyclopedia', 
}

this is less ideal, because some directories become way larger than others. Also remember that titles are UTF-8.

For two-letter folders, use outputMode: 'encyclopedia-two'

Flat results:

if you want all files in one flat directory, you can do:

let opts = {
  outputDir: './results', 
  outputMode: 'flat', 
}
Results in one file:

if you want all results in one file, you can do:

let opts = {
  outputDir: './results', 
  outputMode: 'ndjson', 
}

Options

let opts = {
  // directory for all our new files
  outputDir: './results', // (default)
  // how we should write the results
  outputMode: 'nested', // (default)


  // which wikipedia namespaces to handle (null will do all)
  namespace: 0, //(default article namespace)
  // define how many concurrent workers to run
  workers: cpuCount, // default is cpu count
  //interval to log status
  heartbeat: 5000, //every 5 seconds
  
  // parse redirects, too
  redirects: false, // (default)
  // parse disambiguation pages, too
  disambiguation: true, // (default)

  // allow a custom wtf_wikipedia parsing library
  libPath: 'wtf_wikipedia', // (default)

  // should we skip this page or return something?
  doPage: function(doc){ return true}, // (default)

  // what do return, for every page
  parse: function(doc){return doc.json()}, // (default)  - avoid using an arrow-function

}


Customization

Given the parse callback, you're free to return anything you'd like. Sometimes though, you may want to parse a page with a custom version of wtf_wikipedia parser - if you need any extra plugins or functionality.

Here we apply a custom plugin to our wtf lib, and pass it in to be available each worker:

in ./myLib.js

import wtf from 'wtf_wikipedia'

// add custom analysis as a plugin
wtf.extend((models, templates)=>{
  // add a new method
  models.Doc.prototype.firstSentence = function(){
    return this.sentences()[0].text()
  }
  // support a missing plugin   
  templates.pingponggame = function(tmpl, list){
    let arr = tmpl.split('|')
    return arr[1] + ' to '+ arr[2]
  }
})
export default wtf

then we can pass this version into dumpster-dip:

import dip from 'dumpster-dip'

dip({
  input: '/path/to/dump.xml',
  libPath:'./myLib.js', // our version
  parse: function(doc) {
    return doc.firstSentence() // use custom method
  }
})

MIT

About

parse a wikipedia dump into tiny files

License:MIT License


Languages

Language:JavaScript 100.0%