chbrown / slda

Supervised Latent Dirichlet Allocation for Classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Supervised Latent Dirichlet Allocation for Classification

This is a C++ implementation of supervised latent Dirichlet allocation (sLDA) for classification.

Note that this code depends on the GNU Scientific Library.

Compiling

git clone https://github.com/chbrown/slda
cd slda
make

You may need to install the gsl first. E.g., on a Mac:

brew install gsl

Estimation

Estimate the model by executing:

slda est <data path> <label path> <settings path> <alpha> <k> <initialization> <output directory>
  • <data path> should point to a single file containing your training data.
    • This should be a file where each line is of the form:

        <M> <term_1>:<count> <term_2>:<count> ... <term_N>:<count>
      
    • where <M> is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. (For an example, see test/images/train-data.dat.)

  • <label path> points to a file of labels
    • Each line should consist of a single integer, starting with 0, up to C-1, if we have C classes.
    • This file should have the same number of lines as the file specified by <data path>.
  • <settings path> should point to a file with various settings, e.g., settings.txt
  • <alpha> is a floating point hyperparameter (a prior)
  • <k> is the number of topics
  • <initialization> specifies the initialization method. There are three options:
    • "seeded"
    • "random"
    • <model path> (a path to some pre-existing model)
  • <output directory> should point to a directory where the estimator's output will be stored. This directory will be created if it does not already exist.
    • The estimator outputs models in two types of files:

      • <iteration>.model is the model saved in the binary format, which is easy and fast to use for inference.
      • <iteration>.model.text is the model saved in the text format, which is convenient for printing topics or further analysis using a scripting language.
    • It also produces variational posterior Dirichlets in a file called:

      • <iteration>.gamma
    • Running the estimator on the 8-class image dataset produces the output:

        010.gamma
        010.model
        010.model.text
        020.gamma
        020.model
        020.model.text
        final.gamma
        final.model
        final.model.text
        likelihood.dat
        word-assignments.dat
      

Example usage:

./slda est test/images/train-data.dat test/images/train-label.dat \
    settings.txt 1.0 10 random tmp/

Inference

To perform inference on a different set of data (in the same format as for estimation), execute:

slda inf <data path> <label path> <settings path> <model path> <output directory>
  • <data path>, <label path>, and <settings path> are all the same as in the estimation step.
  • <model path> is the binary final.model file from the estimation step.
  • <output directory> is the output directory, where the predicted labels will be stored.
    • Each output file has one line per input document.
      • inf-gamma.dat describes the variational posterior Dirichlets
      • inf-labels.dat displays the predicted labels
      • inf-likelihood.dat depicts each document's likelihood

Example usage:

./slda inf test/images/test-data.dat test/images/test-label.dat \
    settings.txt tmp/final.model tmp/

This will also produce a final line of output, evaluating against the labels specified in the <label path> argument:

average accuracy: 0.679

Sample data

The sample data in test/images was downloaded from http://www.cs.cmu.edu/~chongw/data/images.tgz on July 12, 2013.

Description of data from original site:

A preprocessed 8-class image dataset from Labelme.

UIUC Sports annotation files: annotations and meta information. The source image files can be found here. (Note: there might be some discrepancies and I don't seem to know why...)

License

Copyright © 2009, Chong Wang, David Blei and Li Fei-Fei

Licensed under both the GPL v2 and GPL v3, as well as any future version of the GNU General Public License.

About

Supervised Latent Dirichlet Allocation for Classification

License:GNU General Public License v2.0


Languages

Language:C++ 89.8%Language:C 10.2%