grakic / textcat-sr

Serbian Cyrillic and Latin language models for libexttextcat, a free software n-gram based language guessing library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Serbian Cyrillic and Latin language models for libtextcat

Copyright (c) 2010  Goran Rakic <grakic@devbase.net>


libtextcat is a free software n-gram based language guessing library. Please
go to http://software.wise-guys.nl/libtextcat/ for source and documentation.

Models are generated from a collection of texts from the Serbian Wikipedia as
saved in serbian.txt using text_cat Perl script by Gertjan van Noord from
http://odur.let.rug.nl/~vannoord/TextCat/

UTF-8 language models are ready for use with a patched libtextcat 2.2 with a
patch available in OpenOffice.org source tree: libtextcat/libtextcat-2.2.patch

ISO-8859-2 and ISO-8859-5 language models can be used with original TextCat and
libtextcat or compatible. Please note that UTF-8 language models will not work
with a non-patched libtextcat because it does not work with multibyte encodings.


Usage example:

$ ./libtextcat-2.2-patched/src/testtextcat 
TextCat 2.2 (out of place)
Zdravo svete
Result == [serbian-Latn][serbian-Latn-iso8859-2]
That took 0 ms.

goran@limun:~/Projects/dict-sr/textcat-sr$ ./libtextcat-2.2-patched/src/testtextcat 
TextCat 2.2 (out of place)
Здраво свете
Result == [serbian]
That took 0 ms.


License:

Serbian language models are released under disjunctive tri-licence GNU
LGPL version 2.1 or later / MPL version 1.1 or later / GNU GPL version 2 or
later giving you the choice of one of the three sets of free software licensing
terms. You can also use the under the terms of Creative Commons BY-SA 3.0
Unpored licence.

About

Serbian Cyrillic and Latin language models for libexttextcat, a free software n-gram based language guessing library