crunchypi / wikinodes-preprocessing

Visual explorer of Wikipedia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

wikinodes-preprocessing

This repo contains code for populating Neo4j(db) with nodes representing Wikipedia articles (and connecting them based on embedded hyperlinks) -- it's meant as a part of a larger project (Wikinodes). Code for serving content & a related React app is found at:

Usage:

First, Neo4j should be installed and started. Additionally, there are a couple of (Python) module dependencies, namely 'neo4j' (driver module) and 'wikipedia' (wiki API interface module). The latter module seems to have a known issue which can trigger an exception, see details here. Essentially, the module should be installed with python -m pip install --upgrade git+git://github.com/goldsmith/Wikipedia.git.


Using this code is primarily done through the cli.py file in root, the options are as follows (help copy-paste):

-------------------------------------------------
CLI tool for orchestrating scripts in submodules.
-------------------------------------------------
Baic usage:
    Arguments are partially sensitive to position.
    State between arguments is kept, meaning that
    if arg2 needs data from arg1, then it will work
    as long as arg1 comes before arg2, even with an
    arbitrary amount args in-between.
Arguments:
    -titles         Specify path where article
                    names are listed.
    -wikiapi        Uses data generated from 
                    <-titles> arg to pull data
                    from wikipedia. This arg
                    expects a value specifying
                    the amount of subsearches for
                    each article. These sub-searches
                    are based on hyperlinks in each
                    article. 0 = None.
    -neo4j          Prepare a neo4j interface obj.
                    Arg vals are expected to be:
                        -neo4j uri,usr,pwd
    
    -createdb       Pushes data created with
                    -wikiapi into the neo4j db.
                    This arg has to come after
                        -wikiapi (for data)
                        -neo4j (for db connection).
                    
    -link          Try linking wiki nodes in neo4j.
                   Note: expects -neo4j arg to be
                   used before this one.
Examples:
    Use data in './data.txt' to fetch article names
    and use that to retrieve data from wikipedia:
    > -titles ./data.txt -wikiapi
    Previous example but with pushing data into Neo4j (
    each argument is a new line for formatting purposes):
    >   -titles ./data.txt 
        -wikiapi 0
        -neo4j neo4j://localhost:7687,neo4j,neo4j
        -createdb
    Link nodes in db.
    > -neo4j neo4j://localhost:7687,neo4j,neo4j -link


The schema is defined in 'src/typehelpers.py' and will be as follows:

  • Node labels for each article: 'WikiData'
  • Node relationships: 'HYPERLINKS'
  • Property for wiki article title: 'title'
  • Prop for wiki article url: 'url'
  • Prop for wiki article content (cleaned ish): 'content'
  • Prop for wiki article links (embedded hyperlinks): 'links'
  • Prop for wiki article html (raw content): 'html'
  • There is also a final property named 'topic' which is deprecated.

Should also mention that this CLI automatically creates a 'fulltext' index (see neo4j documentation) on WikiData.content (node and property); that is used for a search feature of the server and app repos (search bar for lookin for specific articles through their content). Index name is 'ArticleContentIndex' and the process is started in 'createdb' (func) in 'cli.py'. Also, this repo has a hardcoded rate limit (in addition to the rate limit set by the aforementioned 'wikipedia' module) of 1 second per request; that can be adjusted at the top of 'src/data_gen/wikiapi.py'.


Finally, the code is fairly well documented but I've also added a wiki page for this repo as a reference manual for completion purposes.

About

Visual explorer of Wikipedia


Languages

Language:Python 100.0%