srush / MRF-LM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MRF-LM

Fast Markov Random Field Language Models

[Documentation in progress]

An implementation of a fast variational inference algorithm for Markov Random Field language models as well as other Markov sequence models.

This algorithm implemented in this project is described in the paper

A Fast Variational Approach for Learning Markov Random Field Language Models
Yacine Jernite, Alexander M. Rush, and David Sontag.
Proceedings of ICML 2015.

Available here.

Building

To build the main C++ library, run

bash build.sh

This will build liblbfgs (needed for optimization) as well as the main executables. The package requires a C++ compiler with support for OpenMP.

Training a Language Model

The training procedure requires two steps.

First you construct a moments file from the text data of interest. We include the standard Penn Treebank language modelling data set as an example. This data is located under lm_data/ . To extract moments from this file run

python Moments.py --K 2 --train lm_data/ptb.train.txt --valid lm_data/ptb.valid.txt

Next run the main mrflm executable providing the training moments, validation moments, and an output file for the model.

./mrflm --train=lm_data/ptb.train.txt_moments_K2.dat --valid=lm_data/ptb.valid.txt_moments_K2.dat --output=lm.model

This command will train a language model, compute validation log-likelihood, and write the parameters out to lm.model. (These parameter settings will correspond to Figure 6 in the paper.)

MRF-LM

The main MRF executable has several options for controlling the model used, training procedure, and the parameters of dual decomposition.

usage: ./mrflm --train=string --valid=string --output=string [options] ...
options:
  --train           Training moments file. (string)
  --valid           Validation moments file. (string)
  -o, --output      Output file to write model to. (string)
  -m, --model       Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters). (string [=LM])
  -D, --dims        Size of embedding for low-rank MRF. (int [=100])
  -c, --cores       Number of cores to use for OpenMP. (int [=20])
  -d, --dual-rate   Dual decomposition subgradient rate (\alpha_1). (double [=20])
  --dual-iter       Dual decomposition subgradient epochs to run. (int [=500])
  --mult-rate       Dual decomposition subgradient decay rate. (double [=0.5])
  --keep-deltas     Keep dual delta values to hot-start between training epochs. (bool [=0])
  -?, --help        print this message

There is a separate executable for testing the model after it is written.

usage: ./mrflm_test --model-name=string --valid=string [options] ...
options:
  -o, --model-name      Output file to write model to. (string)
  -m, --model           Model to use, one of LM (LM low-rank parameters), LMFull (LM full-rank parameters), Tag (POS tagger). (string [=LM])
      --valid           Validation moments file. (string)
      --train           Training moments file. (string [=])
      --embeddings      File to write word-embeddings to. (string [=])
      --vocab           Word vocab file. (string [=])
      --tag-features    Features for the tagging model. (string [=])
      --tag-file        Tag test file. (string [=])
      --tag-vocab       Tag vocab file. (string [=])
  -c, --cores           Number of cores to use for OpenMP. (int [=20])
  -?, --help            print this message

Training a Tagging Model

The tagging model can be trained in a very similar way. We assume that the data is in the CoNLL parsing format and under the tag_data directory. To construct the moments run the following command

python MomentsTag.py tag_data/ptb.train.txt tag_data/ptb.valid.txt tag_data/ptb.test.txt tag

Next run the main mrflm executable providing the moments, the tag features, and validation in data.

./mrflm --train=tag_data/ptb.train.txt.tag.counts  --valid=tag_data/ptb.valid.txt.tag.counts --output=tag.model --model=Tag --tag-features=tag_data/ptb.train.txt.tag.features --valid-tag=tag_data/ptb.valid.txt.tag.words

This command will train a tagging model, compute validation by running the Viterbi algorithm, and write the parameters out to tag.model.

Advanced Usage

Word Embeddings

Once a model is trained mrflm_test can be used to view the embeddings produced by the model.

./mrflm_test  --model-name=lm.model --embeddings embed --vocab lm_data/ptb.train.txt_vocab_K2.dat

This will output two files. The file embed will contain the word embedding vectors one per line. The file embed.nn will contain the 10 nearest neighbors for each word in the vocabulary.

Tagger

The tagging model can also be used after it is trained. To run the tagger on a data set (such as test), use the following command.

./mrflm_test --model-name=tag.model   --model Tag --tag-file=tag_data/ptb.test.txt.tag.words --tag-features=tag_data/ptb.train.txt.tag.features --vocab=tag_data/ptb.train.txt.tag.names --tag-vocab=tag_data/ptb.train.txt.tag.tagnames --train=tag_data/ptb.train.txt.tag.counts  --cores=1

Moments File Format

The input to the main implementation is a file containing the moments of the lifted MRF. The moments file assumes the lifted graph is star-shaped with the central variable as index one. The format of the file is

{N = # of samples}
{L = # of variables}
{# of states of variable 1} {# of states of variable 2} ...(l columns)
{M1 = # of 2->1 pairs}
{State in 2} {State in 1} {Counts}
{State in 2} {State in 1} {Counts}
...(M1 rows)
{M2 = # of 3->1 pairs}
{State in 3} {State in 1} {Counts}
{State in 3} {State in 1} {Counts}
...(M2 rows)

This file format is used for both language modelling and tagging.

LM Moments File

Consider a language modelling setup.

For example, let's say we were building a language model with the training corpus:

the cat chased the mouse

If our model has context K = 2, then we transform the corpus to:

<S> <S> the cat chased the mouse

After the transformation the number of samples is N = 7, the number of variables is L = K+1 = 3, the vocabulary size/number of states is V=5, and the dictionary is:

<S> 1
the 2
cat 3
chased 4
mouse 5

The corresponding moments file would then look like:

7
3
5 5 5
7
5 1 1
1 1 1
1 2 1
2 3 1
3 4 1
4 2 1
2 5 1
7
2 1 1
5 1 1
1 2 1
1 3 1
2 4 1
3 2 1
4 5 1

Tagging Moments File

Now consider a tagging setup. Let's say we were building a tagging model with the training corpus:

the/D cat/N chased/V the/D mouse/N

If our model has context K=1, M=3 (roughly corresponding to Figure~7 in the paper) then we transform the corpus to:

<S>/<T> the/D cat/N chased/V the/D mouse/N

After the transformation the number of samples is N = 6, the number of lifted variables is L = M + K+1 = 5, the number of tag states is T=4 and V=5 as above, and the tag dictionary is:

<T> 1
D 2
N 3
V 4

The corresponding moments file would then look like:

6
5
4 5 5 5
6
3 1 1
1 2 1
2 3 1
3 4 1
4 2 1
2 3 1
...

Code Structure

The code is broken into three main classes

  • Train.h; Generic L-BFGS training. Implements most of Algorithm 2.

  • Inference.h; Lifted inference on a star-shaped MRF. Implements Algorithm 1.

  • Model.h; Pairwise MRF parameters. Implements likelihood computation, gradient updates, and lifted structure.

The Model.h class is a full-rank MRF by default, but can be easily extended to allow for alternative parameterization. See LM.h for the low-rank language model with back-prop (Model 2 in the paper), and Tag.h for a feature factorized part-of-speech tagging model.

About

License:GNU Lesser General Public License v3.0


Languages

Language:Shell 49.4%Language:C 21.8%Language:C++ 16.1%Language:Python 12.6%Language:Makefile 0.1%