proycon / colibri-utils

NLP utilities that rely on Colibri Core: currently only language identification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Colibri Utils

This collection of command-line Natural Language Processing utilities currently contains only a single tool:

  • Colibri Lang: Language identification - colibri-lang - Detects in what language parts of a document are. Works on both FoLiA XML documents as well as plain text. When given FoLiA XML input, the document is enriched with language annotation and may be applied to any structural level FoLiA supports (e.g. paragraphs, sentences, etc..). When given plain text input, each input line is classified. This tool currently supports a limited subset of languages, but is easily extendable:
    • English, Spanish, Dutch, French, Portuguese, German, Italian, Swedish, Danish (trained on Europarl)
    • Latin (trained on the Clementine Vulgate bible, and some on a few latin works from Project Gutenberg)
    • Historical Dutch:
      • Middle dutch - Trained on Corpus van Reenen/Mulder and Corpus Gysseling
      • Early new dutch - Trained on Brieven als Buit

Installation

Colibri Utils is included in our LaMachine distribution, which is the easiest and recommended way of obtaining it.

Colibri Utils is written in C++. Building from source is also possible if you have the expertise, but requires various dependencies, including ticcutils, libfolia, and colibri-core which all have to be obtained and compiled separately.

$ bash bootstrap.sh
$ ./configure
$ make
$ sudo make install

Usage

See colibri-lang --help

Methodology

To identify languages, input tokens are matches against a trained lexicon with token frequencies, which are loaded in memory. No higher order n-grams are used.

A pseudo-probability is computed for the given sequence of input tokens for each language, the highest probability wins. A confidence value is computed simply as the ratio of tokens in the vocabulary divided by the length of the token sequence. Out of vocabulary words are assigned a very low probability.

Training

New models can easily be trained and added, and are independent of the other models. Simply train an unindexed patternmodel with Colibri Core and put the model file and the class file in your data directory. Ensure the data is tokenised and lower-cased prior to building a pattern model (ucto can do both of this for you). A full example:

$ ucto -n -l -Lgeneric corpus.txt corpus.tok.txt
$ colibri-classencode corpus.tok.txt
$ colibri-patternmodeller -u -t 5 -l 1 -f corpus.tok.colibri.dat -o corpus.colibri.model
$ mv corpus.tok.colibri.cls corpus.colibri.cls
$ sudo cp corpus.colibri.* /usr/local/share/colibri-utils/data/

About

NLP utilities that rely on Colibri Core: currently only language identification

License:GNU General Public License v3.0


Languages

Language:TeX 48.4%Language:Visual Basic 6.0 34.4%Language:OpenEdge ABL 17.2%Language:C++ 0.0%Language:M4 0.0%Language:Shell 0.0%Language:Makefile 0.0%