Simple lemmatizer base on PoliMorf - Polish morphological dictionary, which is created using OpenFst automata.
Lemmatizer contains several scripts:
- convert_data.rb - ruby script for preparing data from PoliMorf in specific format (read more in Word format section)
- convert_fst.rb - ruby script for preparing data from convert_data.rb to simple OpenFst automata
- get_symbol.rb - ruby script for getting all characters (alphabet) from PoliMorf
- prepare_main_data.rb - ruby script for prepraing all data and save them into data/main_data file
- lemmatizer.rb - ruby script running lemmatizer (require data/main_data)
- run.sh - simple bash script for running lemmatizer
- demo.sh - simple demo written in bash (uses example_input.txt file)
- run command
make
(build main_data) - run lemmatizer:
ruby lemmatizer.rb
or./lemmatizer.rb
(wait few second for loading data,CTRL+D
will stop lemmatizer)
--data <data_file_path> - main data path, default path is data/main_data
--only-output - print only output (where is unknown word - not found in PoliMorf), defualt printing fomat is:
> input_word -> founded_words_separated_byt_comma
- when found word in PoliMorf
- input_word
- when word was not found
--ignore-unknown - ignore unknown word - not found in PoliMorf (use with --only-output)
Word format: [input_word][+][how_char_delete][what_add]
[input_word]
- input word
[+]
- static character, separating input word from the rest
[how_char_delete]
- says how many character delete from input word, mapping is described in data/counter_map file, e.x. 0 delete 0 characters, A delete 1 characters, B delete 2 characters and so on...
[what_add]
- says what add to input word (after deleting characters), e.x. abc - means add abc characters to input word
Example:
- komputerka+Bek = from komputerka delete 2 characters (B - means 2) and add ek, result is komputerek
- internetach+C = from internetach delete 3 characters (C - means 3) add no add characters, result is internet
- oprogramowanie+0 = from oprogramowanie no delete characters (0 - means 0) and no add characters, result is oprogramowanie
You can use other morphological dictionary, but save them (in PoliMorf-1.tab.gz UTF-8 encoding archive file - Makefile will choose the latest file - sorting by name) in fomat:
[input_word][tabulation = \t][base_word]
example (remember about tabulation):
komputerka komputerek
internetach internet
oprogramowanie oprogramowanie
...
Lines says how delete characters (read more in Word format section in [how_char_delete]
format), file fomat:
[mapped_char][tabulation = \t][how_char_delete]
example (remember about tabulation):
0 0
1 A
2 B
3 C
4 D
...
45 w
46 x
47 y
48 z
Main data for lemmatizer, can be use for custom program. File is divided into several sections:
- Section
<MAP_SYMBOL> X
- says how characters are mapped for input word - depends on the input data (morphological dictionary), whereX
is numer of lines to read- Next
X
lines are in format:[CHARACTER][SPACE][NUMBER]
, example:' 0 + 1 - 2 . 3 0 4 2 5 A 6 ... š 108 ū 109 Ź 110 ź 111 Ż 112 ż 113 ’ 114
- Next
- Section
<MAP_COUNTER_CHAR> X
- says how map[how_char_delete]
(read more in Word format section in[how_char_delete]
format), whereX
is numer of lines to read- Next
X
lines are in format:[NUMBER][SPACE][CHARACTER]
- is the same as data/counter_map, example:0 0 1 A 2 B 3 C 4 D ... 45 w 46 x 47 y 48 z
- Next
- Section
<FINAL> X
- says what states are accepting/final (read more in algorithm section), whereX
is numer of lines to read- Next
X
lines are in format:[NUMBER]
, example:1787 9304 9427 ... 269837 403196 926817
- Next
- Section
<MAP_STATE> X Y Z
- says how beginning state has go through character (characters are mapped by<MAP_COUNTER_CHAR>
section) to next state, whereX
is maximum height of the matrix,Y
is maximum width of the matrix,Z
is numer of lines to read (you can define arraySTATES[X][Y]
or use other data structures, e.x. map/hashmap)- Next
Z
lines are in format:[NUMBER_BEGINNING_STATE][SPACE][NUMBER_CHARACTER][SPACE][NUMBER_NEXT_STATE]
, example:0 6 1 0 7 2 0 8 3 ... 509503 56 607763 509503 87 607760 509504 18 607764 ... 928979 54 928980 928980 32 928981 928981 89 1787
- Next
Lemamatizer structures:
- map (MAP_SYMBOL) for
MAP_SYMBOL
section (see more in section data/main_data point 1), where for line[CHARACTER][SPACE][NUMBER]
,[CHARACTER]
is key of map and[NUMBER]
is value of key and will be necessary reversed map = MAP_SYMBOL_REVERSED, where[NUMBER]
is key of map and[CHARACTER]
is value of key - map (MAP_COUNTER_CHAR) for
MAP_COUNTER_CHAR
section (see more in section data/main_data point 2), where for line[NUMBER][SPACE][CHARACTER]
,[CHARACTER]
is key of map and[NUMBER]
is value of key - array/set (FINAL) for
FINAL
section (see more in section data/main_data point 3), where for line[NUMBER]
,[NUMBER]
is acceping/final state - map/two-dimensional array/matrix = MAP_STATE for
MAP_STATE
section (see more in section data/main_data point 4), where for line[NUMBER_BEGINNING_STATE][SPACE][NUMBER_CHARACTER][SPACE][NUMBER_NEXT_STATE]
,[NUMBER_BEGINNING_STATE]
is valueI
for arraySTATES[I][J]
,[NUMBER_CHARACTER]
is valueJ
for arraySTATES[I][J]
and[NUMBER_NEXT_STATE]
is value for[NUMBER_BEGINNING_STATE] = I
and[NUMBER_CHARACTER] = J
in arraySTATES[I][J]
. Or simpler it is array equalSTATES[NUMBER_BEGINNING_STATE][NUMBER_CHARACTER] = [NUMBER_NEXT_STATE]
- important thing:unknown value should be save as -1, those that were not in the file = defualt value is -1
Lemmatizer have simple algorithm for searching:
- Set beggining state as 0
- For earch character for input word
- If character is not in MAP_SYMBOL return NOT FOUND
- Otherwise save value of character from MAP_SYMBOL into character_int
- Go through MAP_STATE using state_beggining and character_int and save it into state_ending equal state_ending := MAP_STATE[state_beggining][character_int]
- If state_ending not exist (equal -1) return NOT_FOUND
- Otherwise state_beggining is state_ending eqaul state_beggining := state_ending
- For character_int save value of character + (plus) from MAP_SYMBOL
- Go through MAP_STATE using state_beggining and character_int and save it into state_ending
- If state_ending not exist (equal -1) return NOT_FOUND
- Otherwise state_beggining is state_ending
- For each character_int from MAP_SYMBOL where local_state_beggining is state_beggining
- Go through MAP_STATE using local_state_beggining and character_int and save it into local_state_ending
- If local_state_ending not exist (equal -1) or local_state_ending is in FINAL go to next character_int, ignore this iteration
- Add value of character_int from MAP_SYMBOL_REVERSED (letter) into array
- Otherwise local_state_beggining is local_state_ending
- Go deep - similar as point 7 - get all list combination of letters
- For each answer (element) of array
- Copy word input into local_input_word
- Remove characters from local_input_word based on first element from answer - use MAP_COUNTER_CHAR for map first character into number
- Add rest elements answer into cutted local_input_word
- Add local_input_word into returning list
- Return returing list
- Ruby
- Bash
- Makefile
- ~8.5-12 GB RAM (for determinization OpenFst automata) - 8GB RAM if for convert_data.rb will be used
--ignore-with-prefix
flag (will be ignored word with prefix naj and nie when input_word and base_word do not have same beginning characters, e.g.niemniejszący mniejszyć
)
- OpenFst automata is used for showing how pack data into automata
- U can see how OpenFst automata size file change after determinization and minimialization - change
FST_PIPE
to 1, line should beFST_PIPE=0
and run commandmake data/fst_text.fst
- see in data dir size files