mhulden / foma

Automatically exported from code.google.com/p/foma

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reading a full forms lexicon

arademaker opened this issue · comments

The words command produce all pairs of up/lower words. Do we have any command do read a file with those pairs and produce an fst from the pairs?

You can use read spaced-text for that; however, the format required is a little different. You need to separate symbols with spaces and input/output pairs go on separate lines, with newlines in between. Example:

c a t
g a t o

d o g
p e r r o

produces a transducer that maps cat to gato and dog to perro.

Thank you, surely that can help us to have a morphological analyzer out of our full-forms Portuguese Lexicon at https://github.com/LR-POR/MorphoBr/. But, of course, such a transducer is not the perfect solution since it does not capture the rules of the morphology nor the position classes and the respective morphemes.

image


a l e t o l o g i n h a s	 
a l e t o l o g i a +N +DIM +F +PL

Hi @mhulden,

foma[0]: read spaced-text all.foma
Stack full!

I got a stack full error while reading a file with 8,027,574 lines. Any alternative? Can I increase the stack size? The file was created according to the above instructions

% head all.foma
a
a +N +M +SG

a s
a +N +M +PL

a z i n h o
a +N +DIM +M +SG

I was able to compile the spaced-text files

% ll -h *.sp
-rw-r--r--  1 ar  staff    32M Mar 20 16:25 adjectives.sp
-rw-r--r--  1 ar  staff   1.4M Mar 20 16:25 adverbs.sp
-rw-r--r--  1 ar  staff    31M Mar 20 16:25 nouns.sp
-rw-r--r--  1 ar  staff   150M Mar 20 16:25 verbs.sp

with the foma script

% cat compile-m.foma
!Copyright (C) 2023 Alexandre Rademaker

read spaced-text nouns.sp
define nouns ;
clear stack

read spaced-text verbs.sp
define verbs ;
clear stack

read spaced-text adjectives.sp
define adjs ;
clear stack

read spaced-text adverbs.sp
define advs ;
clear stack

save defined morphobr.bin

after changing the https://github.com/mhulden/foma/blob/master/foma/int_stack.c#L22 to 5097152. Does it make sense?

The only strange behaviour I got is that adjectives are not considered:

% echo "fracota" | flookup -a -i morphobr.bin
fracota	fracote+N+F+SG

ar@tenis morpho-br % rg fracota
nouns/nouns-f.dict
16878:fracota	fracote+N+F+SG
16879:fracotas	fracote+N+F+PL
16880:fracotazinha	fracote+N+DIM+F+SG
16881:fracotazinhas	fracote+N+DIM+F+PL

adjectives/adjectives-f.dict
16046:fracota	fracote+A+F+SG
16047:fracotas	fracote+A+F+PL
16048:fracotazinha	fracote+A+DIM+F+SG
16049:fracotazinhas	fracote+A+DIM+F+PL

Any idea?

Consider doing this instead of save defined

regex  nouns | verbs | adjs | advs;
save stack morphbr.bin

(save defined saves several FSTs and flookup only loads one - with the above, you should get a single FST one the stack and save that.)

Thanks, it worked. The strange behavior is that I tested it with nouns and verbs, and it works. That is, an ambiguous word. The problem may be that without this explicit combination of the FSTs with the disjunction. We ended up with an FST with multiple starting states, and the flookup tool tried only one?! But I was using the -a flag!

Anyway, the explicit disjunction to combine the FSTs worked fine!