Wikipedia Generator

Flask dev server

Here's how to spin up a local Flask server for development purposes. You should use a virtualenv to ensure that you're running the required versions of each module, and have a clean working environment.

Get Python 2 (from Python.org or your package manager) and make sure python and pip are in your PATH.
Get virtualenv for Python 2, either through your package manager or pip
```
 pip install virtualenv
```
Create a new virtualenv stored in wiki-gen/www/venv. Assuming you're in the project root:
```
 cd www
 virtualenv venv
```

Turn on the virtualenv and install the requirements

 source venv/bin/activate
 pip install -r requirements.txt

for Windows:

 venv\scripts\activate
 pip install -r requirements.txt

Create the seeds logging database:
```
 cd wiki-gen
 python manage.py initdb
```

Move the four-grams/tokens database to the db directory:

 mv /path/to/big/four/grams/database.db db/wiki-gen.db

Start the server

 cd wiki-gen
 python manage.py runserver

Turn on the virtualenv whenever you're working on the app (ie. step 4 without the install). When you're done, you can turn off the virtualenv and return to normal:

    deactivate

Creating the four-gram database

The app needs a database of four-grams to generate the Markov chains of text. It uses sqlite, since it provides a quick, lightweight, easy-to-use interface, and performs quite well with read-only databases.

Get Wikiforia, and use it to extract a Wikipedia XML dump. Then delete the opening and closing <xml> tags (hexdump allows you to overwrite them with whitespace to prevent a complete rewrite of the file). Make sure Python 3 is installed and in your PATH as python3. Then, run:

./parse_four_grams.py /path/to/xml/dump /path/to/output/db

The script runs very slowly, since random reads/writes on spinning hard drives are very slow, and the script isn't particularly clever about caching. Writing to an in-memory database in a tmpfs helps speed things up considerably, but you're limited to however much RAM you have. Parsing the first ~600,000 articles in the English Wikipedia resulted in a ~9 GB database, with ~450 million four-grams, and ~12 million unique tokens.

The schema for the database is:

#####tokens

Col	Type
id	INTEGER PRIMARY KEY (alias of built-in rowid)
token	TEXT UNIQUE NOT NULL

#####four_grams WITHOUT ROWID

Col	Type
t1_id	INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id)
t2_id	INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id)
t3_id	INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id)
t4_id	INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id)
count	INTEGER DEFAULT 1 NOT NULL

Since there are a limted number of tokens in any language, the size of the database remains managable as more four grams are inserted.

kevinzhe / wiki-gen

Wikipedia Generator

Flask dev server

Creating the four-gram database

About

Languages