grantmwilliams / wiki2neo

Parse Wikipedia XML dumps into Neo4j CSV imports

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wiki2neo

PyPI version shields.io

Produce Neo4j import CSVs from Wikipedia database dumps to build a graph of links between Wikipedia pages.

Installation

$ pip install wiki2neo

Usage

Usage: wiki2neo [OPTIONS] [WIKI_XML_INFILE]

  Parse Wikipedia pages-articles-multistream.xml[.bz2] dump into two Neo4j import
  CSV files:

      Node (Page) import, headers=["title:ID", "id"]
      Relationships (Links) import, headers=[":START_ID", ":END_ID"]

  Reads from stdin by default, pass [WIKI_XML_INFILE] to read from file.

Options:
  -p, --pages-outfile FILENAME  Node (Pages) CSV output file  [default:pages.csv]
  -l, --links-outfile FILENAME  Relationships (Links) CSV output file [default: links.csv]
  --help                        Show this message and exit.

Import resulting CSVs into Neo4j:
$ neo4j-admin import \
    --nodes=:Page=pages.csv \
    --relationships=:LINKS_TO=links.csv \
    --skip-bad-relationships \
    --skip-duplicate-nodes \
    --multiline-fields

Downloads from Wikipedia are in compressed xml.bz2 format. wiki2neo supports parsing either the compressed bz2 file directly or an uncompressed xml file:

# compressed
$ wiki2neo pages-articles-multistream.xml.bz2

# uncompressed
$ wiki2neo pages-articles-multistream.xml

About

Parse Wikipedia XML dumps into Neo4j CSV imports


Languages

Language:Python 100.0%