Here's how to spin up a local Flask server for development purposes. You should use a virtualenv to ensure that you're running the required versions of each module, and have a clean working environment.
-
Get Python 2 (from Python.org or your package manager) and make sure
python
andpip
are in your PATH. -
Get virtualenv for Python 2, either through your package manager or
pip
pip install virtualenv
-
Create a new virtualenv stored in
wiki-gen/www/venv
. Assuming you're in the project root:cd www virtualenv venv
-
Turn on the virtualenv and install the requirements
source venv/bin/activate pip install -r requirements.txt
for Windows:
venv\scripts\activate pip install -r requirements.txt
-
Create the seeds logging database:
cd wiki-gen python manage.py initdb
-
Move the four-grams/tokens database to the db directory:
mv /path/to/big/four/grams/database.db db/wiki-gen.db
-
Start the server
cd wiki-gen python manage.py runserver
Turn on the virtualenv whenever you're working on the app (ie. step 4 without the install). When you're done, you can turn off the virtualenv and return to normal:
deactivate
The app needs a database of four-grams to generate the Markov chains of text. It uses sqlite, since it provides a quick, lightweight, easy-to-use interface, and performs quite well with read-only databases.
Get Wikiforia, and use it to extract a Wikipedia XML dump. Then delete the opening and closing <xml>
tags (hexdump
allows you to overwrite them with whitespace to prevent a complete rewrite of the file). Make sure Python 3 is installed and in your PATH as python3
. Then, run:
./parse_four_grams.py /path/to/xml/dump /path/to/output/db
The script runs very slowly, since random reads/writes on spinning hard drives are very slow, and the script isn't particularly clever about caching. Writing to an in-memory database in a tmpfs helps speed things up considerably, but you're limited to however much RAM you have. Parsing the first ~600,000 articles in the English Wikipedia resulted in a ~9 GB database, with ~450 million four-grams, and ~12 million unique tokens.
The schema for the database is:
#####tokens
Col | Type |
---|---|
id | INTEGER PRIMARY KEY (alias of built-in rowid) |
token | TEXT UNIQUE NOT NULL |
#####four_grams WITHOUT ROWID
Col | Type |
---|---|
t1_id | INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id) |
t2_id | INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id) |
t3_id | INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id) |
t4_id | INTEGER NOT NULL PRIMARY KEY FOREIGN KEY REFERENCES token(id) |
count | INTEGER DEFAULT 1 NOT NULL |
Since there are a limted number of tokens in any language, the size of the database remains managable as more four grams are inserted.