phonetized_ner_srv

Tiny Flask app for phonetization, NE tagging and text distance calculation.

Prerequisites

Python 3 and PyPI packages flask, mordl, textdistance, toxine, transliterate.

Starting the Server

First, place storages of trained MorDL UposTagger, FeatsTagger and NeTagger into srv/models directory. Change the parameter emb_path in ds_config.json file of every storage, so that that path became correct. Note, that the root point for relative paths there is ner_srv. Thus, if your embeddings also placed in the srv/models directory, just add 'model/' in the beginning of each emb_path value.

Second, you may go back to the srv directory and correct port in main.py script.

After that, ensure that you're still in the srv directory and run

sh ./run.sh prod

Or, if you need debug mode, run just

sh ./run.sh

Usage

All services return data in json format.

http://<address>:<port>/api/tokenize/<text>

Returns Parsed CoNLL-U for tokenized text (untagged).

http://<address>:<port>/api/tag/<text>

Returns Parsed CoNLL-U with text tokenized and with UPOS, FEATS and MISC:NE fields filled.

http://<address>:<port>/api/phonetize/<text>?level=3&syllables=false

Returns phonetized version of text. Only texts in Russian are processed correctly.

level: the level of simplification. Allowed values:

0 means no changes at all but excess spaces;
1 removes all spaces;
2 most standard version of phonetization;
3 refined phonetization;
4 rude phonetization;
5 even more rude.

Default level is 3.

syllables: if true, returns array of syllables instead of just text phonetized. Default is false.

http://<address>:<port>/api/text-distance/<text1>/<text2>?ner1=&ner2=&level=3&algorithm=damerau_levenshtein&normalize=true&qval=1

Returns text distance between text1 and text2. Only text in Russian are processed correctly.

ner1: if specified, at the start, text1 will be tokenized and tagged, and then replaced by FORM fields of tokens that have ner1 as value of the MISC:NE field.

ner2: if specified, at the start, text2 will be tokenized and tagged, and then replaced by FORM fields of tokens that have ner2 as value of the MISC:NE field.

level: before calculating the distance, both text1 and text2 will be phonetized with that level (see api/phonetize service).

algorithm: what method to use to calculate the distance. Allowed values are: hamming, levenshtein, damerau_levenshtein (default), jaro, jaro_winkler, gotoh, smith_waterman.

normalize: use normalized distance (default is true).

qval: use 1 (default).

License

phonetized_ner_srv is released under the Apache License. See the LICENSE file for more details.

fostroll / phonetized_ner_srv

phonetized_ner_srv

Prerequisites

Starting the Server

Usage

License

About

Languages