Wikimalea
Database Setup
- Create the PostgreSQL database in which the Poisson, word count vectors will
be stored:
$ createdb wikimalea
$ createuser wikimalea
$ psql -d wikimalea -U wikimalea
- Initialize the database schema:
CREATE TABLE terms (
id serial PRIMARY KEY,
term varchar(255) NOT NULL UNIQUE
);
CREATE TABLE documents (
id serial PRIMARY KEY,
title varchar(256) NOT NULL UNIQUE,
redirect varchar(256) REFERENCES documents (title) ON DELETE CASCADE,
text text
);
CREATE TABLE term_frequencies (
document_id integer NOT NULL REFERENCES documents (id) ON DELETE CASCADE,
term_id integer NOT NULL REFERENCES terms (id) ON DELETE CASCADE,
frequency integer NOT NULL DEFAULT 0,
PRIMARY KEY (document_id, term_id),
CHECK (frequency >= 0)
);
Download the English, Wikipedia Corpus
- Ensure that the
resources/
directory is present:
- Download the latest archive of the English, Wikipedia corpus:
$ cd resources/
$ wget -c http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
- Unpack the archive:
$ bunzip2 enwiki-latest-pages-articles.xml.bz2
- That's it!