giellalt / lang-rus

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Russian language

Home Page:https://giellalt.uit.no

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hfst-compose-intersect in src/Makefile_L2 leads to HfstFatalException

reynoldsnlp opened this issue · comments

It may be that the best way to solve this problem is to properly integrate the L2 makefile into the automake build (see #10). Maybe @snomos can help determine how difficult that will be.

In the root directory, running $ make && cd src && make -f Makefile_L2 -B throws an HfstFatalException. The problem seems to stem from the number of error tags in L2_ORTH_ERRS. I have tried various combinations to see if there is some kind of conflict between the rules, but every small subset I have tried works without error. (However, maybe I just haven't tested the right combination yet) I ran 12 different rotations of the 12 tags, and it fails on the 10th tag every time.

The regex files for L2_ORTH_ERRS are shown here (removing comments and empty lines):

$ tail -n +1 src/orthography/L2_*.regex | grep -v ^# | grep -v ^$
==> src/orthography/L2_Akn.regex <==
а (<-) о ;
==> src/orthography/L2_e2je.regex <==
е (<-) э ;
==> src/orthography/L2_H2S.regex <==
ь (<-) ъ ;
==> src/orthography/L2_i2j.regex <==
й (<-) и ;
==> src/orthography/L2_i2y.regex <==
ы (<-) и ;
==> src/orthography/L2_Ikn.regex <==
и (<-) е ,
и (<-) я ;
==> src/orthography/L2_j2i.regex <==
и (<-) й ;
==> src/orthography/L2_je2e.regex <==
э (<-) е ;
==> src/orthography/L2_NoSS.regex <==
0 (<-) ь ;
==> src/orthography/L2_sh2shch.regex <==
щ (<-) ш ;
==> src/orthography/L2_shch2sh.regex <==
ш (<-) щ ;
==> src/orthography/L2_y2i.regex <==
и (<-) ы ;

The offending code is this loop in Makefile_L2. It appears that hfst-compose-intersect is outputting a bad transducer and hfst-disjunct is choking on it:

	for tag in $(L2_ORTH_ERRS) ; \
	do \
		echo "[ ? -> ... \"\+Err\/L2_$${tag}\" || _ .#. ]" > add-tag-err-L2_$${tag}.regex.tmp ; \
		hfst-regexp2fst  --format=foma --xerox-composition=ON -v  \
			-S add-tag-err-L2_$${tag}.regex.tmp -o add-tag-err-L2_$${tag}.hfst ; \
		printf "read regex @\"orthography/L2_$${tag}.compose.hfst\" \
			.o. @\"analyser-gt-desc.hfst\" \
			;\n \
			save stack err.orth.tmp.hfst\n \
			quit\n" | hfst-xfst -p -v --format=foma ; \
		hfst-subtract -F err.orth.tmp.hfst \
			      analyser-gt-desc-L2.tmp.hfst \
			      > err.uniq.tmp.hfst ; \
		hfst-compose-intersect -v -1 err.uniq.tmp.hfst \
		      -2 add-tag-err-L2_$${tag}.hfst \
		      -o err.tagged.tmp.hfst ; \
		hfst-disjunct -1 analyser-gt-desc-L2.tmp.hfst \
		      -2 err.tagged.tmp.hfst \
		      | hfst-determinize \
		      | hfst-minimize \
		      > err.tmp.hfst ; \
		mv err.tmp.hfst analyser-gt-desc-L2.tmp.hfst ; \
		echo "слово" | hfst-lookup analyser-gt-desc-L2.tmp.hfst ; \
		hfst-summarize --verbose analyser-gt-desc-L2.tmp.hfst ; \
	done

The last relevant bit of output is the following:

Reading from add-tag-err-L2_sh2shch.regex.tmp, writing to add-tag-err-L2_sh2shch.hfst
Compiling expression #1
Using foma as output handler
Reading from standard input...
? bytes. 167693 states, 372271 arcs, ? paths
hfst[1]: hfst[1]: hfst[1]: .
hfst-subtract: warning: Warning: analyser-gt-desc-L2.tmp.hfst contains flag diacritics. The result of subtraction may be incorrect.
hfst-compose-intersect: warning:
Found output multi-char symbols ("+A") in
transducer in file err.uniq.tmp.hfst which are not found on the
input tapes of transducers in file add-tag-err-L2_sh2shch.hfst.
Reading from err.uniq.tmp.hfst and add-tag-err-L2_sh2shch.hfst, writing to err.tagged.tmp.hfst
Reading and minimizing rule xre(?)...
Reading lexicon... subtract(?stdin?, ?stdin?) read
Computing intersecting composition...
Storing result in err.tagged.tmp.hfst...
terminate called after throwing an instance of 'HfstFatalException'
hfst-determinize: Aborted (core dumped)
<stdin> is not a valid transducer file