tector

Build

First create the missing Makefile and configure script by running autoreconf, then run the configure script.

autoreconf --force --install
./configure

Once you executed these commands, you can build the source by running

make
make check

Usage

Tector comes with three programs: filter, vocab and model.

The filter program is just a handy tool to clean text. It performs stemming and stopword removal. You can run filter by passing files as arguments or piping text to its stdin. The cleaned text will be written to stdout.

filter < dirty.txt > clean.txt
filter dirty01.txt dirty02.txt > clean.txt

The vocab program is used to create vocabularies. A vocabulary is basically just a list of words and their frequency. A word that isn't part of the vocabulary won't be recognized by the language model, so make sure to create the vocabulary on a rich set of data.

The model program creates and trains the language model and computes the vector representations of all words in your vocabulary. To outline the basic usage of vocab and model, here's a simple example.

Example

First create an empty directory that will contain the vocabulary and language model files. This directory will be the first argument of all subsequent vocab and model calls.

mkdir example

Prepare to train a vocabulary by calling vocab create. Then train it on input - let's say the directory text/ contains a lot of text files.

vocab create example
vocab train example text/*

That's it. You can call vocab train multiple times. To take a look at the words in your vocabulary, run

vocab print example

Now that the vocabulary is ready, create a language model by calling

model create example

I need to squeeze a warning in here: right now the language model can not deal with vocabulary changes. So if you want to re-train your vocabulary, copy it to another directory and work on it there.

Back to the example. Just like vocabulary, the language model can be trained by calling

model train example text/*

And just like the vocab train command, model train can be called multiple times.

model train example more_text/*
model train example even_more_text/*

After sufficient training, generate the word vectors by calling

model generate example

That's it.

License

tector is released under Apache License, Version 2.0. You can find a copy of the Apache License in the LICENSE file.

slyrz / tector

tector

Build

Usage

Example

License

About

Languages