- Linux (with write access to
/tmp
) - Python 2.7.6
- MIT Language Modeling Toolkit
- I am using SUSE running 3.11.10-17.
- You may need to install python-daemon 1.5.5. See PEP 3143 for example usage.
- You must install the MIT Language Modeling Toolkit, and the tools must be in your path.
See braind -h
for basic usage. The exception handling is limited.
The brain estimates two n-gram language models, fixing the vocabulary for each
model to tokens from a specified file. Optionally, you can set the order
(default=3) and you can specify the smoothing algorithm (default='ModKN').
Assuming braind
is in your path, enter
$ braind path/to/training/data path/to/vocabulary/file
This will estimate a trigram back-off model using Modified Kneser-Ney smoothing
from the training data and write the ARPA file to /tmp/blm
. It will also
construct a statically interpolated model from component n-gram models and write
the ARPA file to /tmp/ilm
.
The vocabulary file must have one token on each line. To build an exemplar vocabulary file from the Shakespeare corpus, enter
$ estimate-ngram -text shakespeare.txt -write-vocab shakespeare.vocab
Tokens in the vocabulary file that are not in the training data will be factored into the models' unigram estimates.
Then the brain will make two named pipes, /tmp/fifo0
and /tmp/fifo1
, to
serve as communication channels to/from the n-gram server before daemonizing the
process.
When the server is started, the first thing it will do is deserialize /tmp/blm
and /tmp/ilm
into distinct key-value stores. Obviously, this is a horribly
inefficient data structure for this purpose, but performance is not the key
concern for our current work. The prompt will return after the server loads the
language models. You can verify the server is running by examining the
output of ps x
.
Clients can communicate with the language model server by reading/writing JSON
documents to the two named pipes. A client should write requests to /tmp/fifo0
and read responses from /tmp/fifo1
.
A request is a JSON document with four fields: model
, flavor
, history
and
length
.
-
model
can take one of two values:backoff
orinterpolate
. -
flavor
can take one of three values:up
,down
orstrange
.up
corresponds to the natural distribution over a set of tokens.down
corresponds to the unnatural distribution.strange
linearly interpolatesup
anddown
where lambda is 0.5. -
history
contains the initial prefix as an array of strings. If you do not have a prefix, then the history field must be the empty string (i.e.,"history" : [""]
). Also, suppose you estimate a language model where--order
is greater than one and you request a stream longer than one token. In this case, if you don't have a history (or the length of the history is suboptimal), then the brain will essentially bootstrap the history to lengthn-1
, wheren
is the order of the model. For instance, if you estimated a trigram and you request a stream four tokens long but you don't have a history, then the brain will query the unigrams for the first token. Then it will query the bigrams, using the first token, for the second token. Then it will query the trigrams, using the first two tokens, for the third token. Finally, it will query the trigrams, using the two most recent tokens, for the fourth token in the stream. -
length
is an integer which governs the length of the sentence to query from the language model.
A minimal request has the form
{ "model" : "backoff", "flavor" : "up", "history" : [""], "length" : 1 }
The server will respond with a unigram sample from the back-off model using the natural distribution over unigrams.
Note that the brain will apply the Markov assumption to your history. In other
words, the history will be sliced according to the order: history[-(n-1):]
.
This means that if you estimate trigram models, but your history is ["Because", "I", "could", "get", "Uromysitisis", "poisoning", "and", "die", "."]
, then the
brain will only consider ["die", "."]
when it queries a trigram.
Because I could get Uromysitisis poisoning and die. That's why! Do you think I enjoy living like this? The shame? The humiliation? You know I have been issued a public urination pass by the city because of my condition. Unfortunately my little brother ran out of the house with it this morning. Him and his friends are probably peeing all over the place. You want to call the Department of Social Services? Oh, it's Saturday. They're closed today. My luck.
The response is a JSON document with one field: stream
.
stream
contains the sampled sentence as an array of strings. Here is an
example using the Seinfeld corpus:
cat request
{ "model" : "backoff", "flavor" : "up", "history" : ["die", "."], "length" : 3 }
cat request > /tmp/fifo0 && cat /tmp/fifo1
{ "stream" : ["That's", "why", "!"] }
The response is a sequence of three tokens. The sequence is seeded by the history, and the tokens are sampled from the back-off model using the natural distribution over tokens.
Send a SIGTERM to kill the daemon: kill PID
. A simple handler has been
implemented to clean up the /tmp
scratch files.