wiki2neo
Produce Neo4j import CSVs from Wikipedia database dumps to build a graph of links between Wikipedia pages.
Installation
$ pip install wiki2neo
Usage
Usage: wiki2neo [OPTIONS] [WIKI_XML_INFILE]
Parse Wikipedia pages-articles-multistream.xml[.bz2] dump into two Neo4j import
CSV files:
Node (Page) import, headers=["title:ID", "id"]
Relationships (Links) import, headers=[":START_ID", ":END_ID"]
Reads from stdin by default, pass [WIKI_XML_INFILE] to read from file.
Options:
-p, --pages-outfile FILENAME Node (Pages) CSV output file [default:pages.csv]
-l, --links-outfile FILENAME Relationships (Links) CSV output file [default: links.csv]
--help Show this message and exit.
Import resulting CSVs into Neo4j:
$ neo4j-admin import \
--nodes=:Page=pages.csv \
--relationships=:LINKS_TO=links.csv \
--skip-bad-relationships \
--skip-duplicate-nodes \
--multiline-fields
Downloads from Wikipedia are in compressed xml.bz2
format. wiki2neo
supports
parsing either the compressed bz2
file directly or an uncompressed xml
file:
# compressed
$ wiki2neo pages-articles-multistream.xml.bz2
# uncompressed
$ wiki2neo pages-articles-multistream.xml