mzilinec / trainable-tokenizer

Fast and trainable tokenizer for natural languages relying on maximum entropy methods.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

trtok

a fast and trainable tokenizer for natural languages

Trtok is a very universal performance-oriented tokenizer for processing natural languages. It reads text and tries to correctly detect sentence boundaries and divide the text into tokens.

Trtok does not implement any specific heuristic to perform these tasks, instead it lets the user define rules for potential joining and splitting of words into tokens and sentences. The final decision whether to split or join words and whether to break sentences is left to a conditional probabilistic model which is trained from user-supplied annotated data. The way the trainer understands the data can be extensively customized by the user who can define his own features and specify which features are significant for what tokens.

1) Tokenization schemes

The user might want to use trtok for processing more than 1 language or for processing 1 language in many ways. These different ways of tokenization are described by "tokenization schemes". Their definitions reside in the "schemes" subdirectory of the installation directory. Every folder inside "schemes" defines a single tokenization scheme by way of various configuration files.

Tokenization schemes may be nested to represent a sort of scheme inheritance where a scheme inherits all the configuration files of its ancestors unless it redefines them by having a configuration file of the same name.

a) Rough tokenization rules

The tokenizer identifies all potential token and sentence boundaries within
the text and uses them and the whitespace to split the text into short
segments called rough tokens. The ambiguous boundaries are placed according
to the tokenization scheme. Files with the .split extension define positions
where a word may be broken into two tokens (called a MAY_SPLIT). Files with
the .join extension define positions where two words may be joined into a
single token (MAY_JOIN). Finally, files with the .break extension define
positions at which there might be a sentence break (MAY_BREAK_SENTENCE).

All of the above-named files must contain lines of pairs of
whitespace-delimited regular expressions. If the text leading to a position
and the text following it match the two paired regular expressions
respectively, the ambiguous boundary (MAY_SPLIT for .split files, MAY_JOIN
for .join files or MAY_BREAK_SENTENCE for .break files) is placed at that
position.

The grammar of the regular expressions in these files is the one used by
Quex and described in detail at
http://quex.sourceforge.net/doc/html/usage/patterns/context-free.html. Particularly
take care since Quex does not handle Unicode characters directly in its
regular expression syntax, so be sure to use the \UXXXXXX escape notation
if you need to make use of them.

The files may contain comments which are lines that begin with the # symbol.

b) User-defined properties

Files with a .rep extension contain a single regular expression from the
family of expressions allowed in PCRE (see pcre.org). A rough token is
marked as having this property if it can be matched to the regular
expression.

Files with a .listp extension define properties using lists of token types.
If a rough token's text is exactly the same as a line from a .listp file,
then that rough token is marked as having the property defined by that
.listp file.

c) Feature selection

Every tokenization scheme must have a file named "features". For each rough
token in the vicinity of the potential split/join/sentence break, it
specifies which features are important for the decision.

A typical line starts by declaring a set of interesting offsets (0 is the
rough token preceding the decision point, -1 the one before it, +1 the one
after it, etc...). These offsets are separated by commas and intervals can
be used for convenience (e.g. -4,-2..+2,5 selects -4,-2,-1,0,1,2,5).

After the offsets comes a colon and a comma separated list of properties.
The property names are the filenames of their definitions without the
extensions and they are limited to the common identifier character set
[a-zA-Z0-9_]. The line is closed with a terminating semicolon.

Apart from these simple features, it is possible to ask for combined
features which bundle the value of different properties of tokens at
different offsets into a single feature value. These are defined on their
own line and are enclosed in parentheses. Inside the parentheses is a "^"
separated list of offset:property pairs. If a combined feature takes
properties from a single token only, the parenthesized expression can
appear on the right-hand side of a typical line instead of a simple
property name and the offsets within its definition are omitted.

Apart from the user-defined properties from the .rep and .listp files, the
tokenizer defines the non-binary property "%length" whose value is the
length of the rough tokenizer and the meta-property "%Word" which generates
a property for each rough token type.

  Example:
    
    -2..+2: %Word;
    -5..5: uppercase, abbreviation, (starts_with_number ^ ends_with_period);
    (0:fullstop ^ 1:initial)

d) Maxent training parameters

More control over the process of training the probabilistic model can be
had by manipulating the "maxent.params" file. This file is an INI-style
configuration file which lets the user set the following parameters, which
get passed directly to the training toolkit.

  event_cutoff=<int>                 All training events which occur less
        times than event_cutoff are ignored. Default 1.

  n_iterations=<int>                 How many iterations at most will the
        iterative method use. Default 15.

  method_name=lbfgs|gis              Which of the two methods L-BFGS or GIS
        is to be used. L-BFGS is recommended. Default lbfgs.

  smoothing_coefficient=<double>     Sigma, the coefficient in Gaussian
        smoothing. Default 0 (no smoothing).

  convergence_tolerance=<double>     The model is regarded as convergent
        when the relative difference between the log-likelihood of the
        succeeding models is < convergence_tolerance. Default 1e-05.

  save_as_binary=false|true          Whether to save the file in a binary
        format which is faster to load and smaller if Maxent was compiled
        with zlib support. Default false.

e) File lists and filename replacement regular expressions

Files [prepare|train|heldout|tokenize|evaluate].[fl|fnre] are for
convenience only and are described later.

2) Running the tokenizer

a) Different ways of selecting input

The first argument passed to the tokenizer selects its mode, which can be
either "prepare", "train", "tokenize" or "evaluate". The second argument is
a path relative to the directory "schemes" which selects the tokenization
scheme to be used. The rest of the arguments are input files and options.

Input files can be specified explicitly on the command line. More files can
be given using the -l (--file-list) option which takes a path to a file and
adds every line of it as another input file.

When running in prepare mode or tokenize mode, an output file for each file
has to be specified and when running in train mode or evaluate mode, a file
with the annotated version has to be specified. These secondary files are
selected by taking the input file's path and transforming it using a regular
expression/replacement string. The filename regular expression/replacement
string is specified using the -r (--filename-regexp) option. The strings
look like replacement commands in sed, where the first character can be any
ASCII character and that character separates the regular expression from
the replacement string and also terminates the entire string. Unlike sed,
this special character cannot be present anywhere else in the string (no
escaping). The breed of regular expressions used here is the one supported
by PCRE, the replacement strings contain the placeholders \0, \1... for the
entire matched string, first captured sequence...
  
  Example:

    trtok train en/simple/brown -l data/brown/train.fl -r "|raw|txt|"
    
In the annotated/tokenized files, sentences are split by newlines and
tokens are split by spaces.

If no input file or file lists were given, a default file list named
<mode_name>.fl, which is part of the tokenization scheme, is used. If no
filename regular expression/replacement string is given, the one in the
file named <mode_name>.fnre from the tokenization scheme is used. In both
cases <mode_name> is expanded to either "prepare", "train", "tokenize" or
"evaluate" depending on the current mode.

If no input file or file lists were given and there are no default file
lists defined by the tokenization scheme, then the tokenizer processes the
standard input and writes to the standard output. This is, however, only
possible for the "prepare" and "tokenize" modes. The standard input/output
combo can also be explicitly selected by specifying the input file "-" on
the command line.

b) Different modes of execution

In "prepare" mode, the tokenizer reads the input, splits it into rough
tokens and then outputs it with all possible splits and sentence breaks
performed. This format might be handy for manual annotators who then only
have to join together parts of tokens and sentences.

In "train" mode, the tokenizer reads both the input and its annotated
version. It uses the annotated data to get pairs of questions (values of
features in a given context surrounding a decision point) and answers
(whether the decision point is to become a joining of tokens, a splitting
of tokens or a sentence break). These pairs are then used to train the
probabilistic model and store it in a file under the "build" directory.

In "tokenize" mode, the tokenizer relies on the presence of an already
trained model and uses it to classify every decision point in the input
file and output the tokenized and segmented text.

In "evaluate" mode, the tokenizer reads both the input and its annotation
as in "train" mode, but now it also queries the trained model for an
opinion and compares it with the one found in the annotated data. The
tokenizer outputs a log of every context and both the predicted and correct
outcomes for later analysis. The "analyze" script provided with trtok will
let you read this output and determine the accuracy of your system.

c) Different options

If you launch trtok with no command line arguments, you will get a summary
of all the supported command line options and their meaning. These include
options for setting the encoding of the input and output files, options for
controlling the output (preserving the original tokenization, segmentation
or paragraph division), the preprocessing of input (if entities are to be
expanded for the duration of the tokenization and if they are to be kept
expanded in the output; if XML should be hidden from tokenization), options
for logging the contexts and outcomes to a third file and others.

3) Running with Docker

The easiest way to get started without managing dependencies is using the provided Dockerfile.

docker build -t trtok .
docker run -it trtok bash

About

Fast and trainable tokenizer for natural languages relying on maximum entropy methods.

License:Other


Languages

Language:C++ 78.0%Language:CMake 20.1%Language:Python 1.5%Language:Dockerfile 0.4%