rsennrich / SMORLemma

SMOR (Stuttgart Morphology) with alternative lemmatization component

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Interset universal tagset mapping

lefterav opened this issue · comments

there is a mapping for POS tags to the STTS tagset (which the Interset tagset uses for German). The morphological tags are (mostly) the same, but you may have to reorder them. The mapping script is part of ParZu (ParZu/preprocessor/morphology/morphisto2prolog.py).

SMOR output:

echo "diesen" | fst-infl2 zmorge-20140521-smor_newlemma.ca
> diesen
diese<+DEM><Subst><NoGend><Dat><Pl><St>
diese<+DEM><Subst><Masc><Acc><Sg><St>
diese<+DEM><Attr><NoGend><Dat><Pl><St>
diese<+DEM><Attr><Masc><Acc><Sg><St>

mapped output:

echo "diesen" | fst-infl2 zmorge-20140521-smor_newlemma.ca | python ParZu/preprocessor/morphology/morphisto2prolog.py 

gertwol('diesen','diese','PDS',['Masc','Acc','Sg'],'').
gertwol('diesen','diese','PDS',[_,'Dat','Pl'],'').
gertwol('diesen','diese','PDAT',['Masc','Acc','Sg'],'').
gertwol('diesen','diese','PDAT',[_,'Dat','Pl'],'').

don't worry about the line-initial 'gertwol' and the Prolog-style representation - you should be able to get the info you want from this. (map _ to *).

There is no mapping to the universal dependency tagset that I know of.

Thanks, this is very useful. Is there a script that would also work for the other direction, i.e. for generation of full word forms, given STTS tags?

not that I know of.

FYI, Lingua::Interset 2.026 (https://metacpan.org/pod/Lingua::Interset::Tagset::DE::Smor) contains the new driver de::smor that can decode and encode SMOR/Zmorge tags. It should be possible to use it to convert to/from STTS and the UD tagset.