rickrizzo / wikipedia-graph

Graphing all of wikipedia's articles

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wikipedia Graph

Kiana McNellis, Ryan Manski, Rob Russo

Description

This project set out to create a graph containing all of Wikipedia's articles. Articles are stored along with the links they contain. These references are then sent to their respective articles such that each article is aware of all of the article it references, as well as all of the articles which reference it.

Parallel Algorithm

To create this graph, the Wikipedia XML dump is first read in to the parser program. This breaks out the articles into individual files and places them into directories based upon their title name. Following the completion of the parsing, the main file reads in the these files and creates a number of article objects by rank. Ranks are ordered so that they each cover a certain group of directories. Once in memory, articles are send to the correct ranks in order to update the articles they reference. Once this communication is complete, the graph is created. For more detailed information, check out the formal report!

Design Considerations

Written using C++ to take advantage of objects.

Running

Our program is split into 3 parts:

  • Generating directories
  • Parsing the wikipedia xml into articles
  • Analyzing the wikipedia graph in parallel

Part 1: Data gathering

The first two can be done serially using:

make download
make unzip
make parse

or

curl -O https://dumps.wikimedia.org/enwiki/20170101/enwiki-20170101-pages-articles-multistream.xml.bz2
bzip2 -kd enwiki-20170101-pages-articles-multistream.xml.bz2

g++ parseFiles.cpp -o parse.out -std=c++98 -g
g++ makeDirs.cpp -o makeDirs.out -std=c++98 -g

./makeDirs.out
./parse.out

NOTE: This is a very large set of data (13GBs compressed, 57G uncompressed)

Part 2: Parallel Analysis

This is the main part of the project, and the part that scaling is tested on.

Using the makefile:

make compile
make run

Commands locally

mpic++ -Wall main.cpp article.cpp -o main.out -lpthread -g -O0 -fno-inline -stdlib=libstdc++
mpirun -n 9 ./main.out 8

Commands on Blue Gene

make blue
sbatch --partition medium --nodes 256 --time 240 ./running.sh

or

mpixlcxx -O5 main.cpp article.cpp -o main.out -qflag=w -lpthread -qlanglvl=variadictemplates
sbatch --partition medium --nodes 256 --time 240 ./running.sh

note that qlanglvl=variadictemplates is used to access the tr1 namespace for unordered maps.

Input & Output

Parsing

parse.out and makeDirs.out have no arguments.

parse.out expects the wikipedia file to be at the path specified by #define FILEPATH (at the top)

Both output to the articles/ directory, with subdirectories organized by first 2 letters of article name

Analysis

main.out Program expects 1 argument, should be of the form:

./main.out <threads>

Where threads is a number between 1 and 8

The number of ranks recommended, and the corresponding number of folders that each would process is recorded in ranks.csv.

It expects that the articles/ directory has already been made and populated. The total number of articles in the wikipedia dump is 17180273.

Output is placed into 3 folders, generated by the program:

  • stats (saves runtime and some statistics)
  • topout (saves articles with the most out links per rank)
  • topin (saves articles with the most in links per rank)

If using runnning.sh, logs are placed in logs/ folder. Program expects this folder to already exist.

References used:

About

Graphing all of wikipedia's articles


Languages

Language:C++ 88.4%Language:Makefile 6.6%Language:Shell 5.0%