rsennrich / SMORLemma

SMOR (Stuttgart Morphology) with alternative lemmatization component

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SMORLemma - An alternative lemmatization for the Stuttgart Morphological Analyzer

This project contains a modified version of SMOR, the Stuttgart Morphological Analyzer: http://www.cis.uni-muenchen.de/~schmid/tools/SMOR/

Compiled morphology tools based on the Zmorge lexicon and this grammar can be downloaded from http://kitt.ifi.uzh.ch/kitt/zmorge/

the master branch contains the following modifications:

  • the grammar and lexicon are converted to UTF-8

  • the grammar contains minor additions and fixes

  • the folder lexicon contains lists of irregular words, some closes-class words, and affixes. it is also set-up to compile a lexicon in the SMOR format from a user-provided lexicon in Morphisto's XML format (such as Zmorge).

the branch lemmatiser contains the following additional modifications:

  • the output is no longer a derivational analysis, but defines the following as the base form:

    • nouns: Nom. Sg. (or Nom. Pl. for plural-only nouns)
    • verbs: infinitive
    • adjectives: Pos. Adv./Pred.
  • morpheme boundaries are still explicity marked, but using different labels:

    • : marks hyphenation (same as original SMOR)

    • <#>: marks compound boundary

    • <->: marks joining element (Fugenelement) in compounds

    • <~>: marks other morpheme boundary

    • original SMOR:

    Vermittlungsgespräche
    vermitteln<V>ung<NN><SUFF>Gespräch<+NN><Neut><Nom><Pl>
    
    • modified SMOR:
    Vermittlungsgespräche
    Vermittlung<->s<#>gespräch<+NN><Neut><Nom><Pl>
    

The main purpose of the master branch is to remain compatible with the original SMOR as far as the output format is concerned, and to make changes that could be adopted upstream clearly visible. Any bug fixes that apply to the unmodified SMOR should be committed in the master branch.

DEPENDENCIES

To compile and use morphological transducers, the following software is required

-Linux (32 and 64 bit)
-sfst
-autotools
-xsltproc
-Python 2.6

INSTALLATION

  1. install all dependencies. In Ubuntu Linux, all are available in the repositories: sudo apt-get install build-essential xsltproc sfst

  2. checkout the lemmatiser branch (unless you want SMOR's original output format) git checkout lemmatiser

  3. copy the lexicon obtained with Zmorge (or any lexicon in Morphisto's XML format) to lexicon/wiki-lexicon.xml.

  4. execute make in the main directory

  5. the final transducer is smor.a (and smor.ca for the compacted version); the 'lemmatiser' branch also has a transducer smor-full.a, which accepts some spelling variations (capitalization, ae for ä, ss for ß).

EXAMPLE COMMANDS

After the transducer is compiled, you can use it with the sfst tools:

for an interactive session, type:

fst-mor smor.a

for batch processing, the compact transducer format *.ca is recommended:

echo "Hochzeit" | fst-infl2 smor.ca

LICENSE

SMOR is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (see LICENSE).

PUBLICATIONS

SMOR is described in:

Helmut Schmid, Arne Fitschen, and Ulrich Heid. 2004: A German computational morphology covering derivation, composition, and inflection. In Proceedings of the IVth International Conference on Language Resources and Evaluation (LREC 2004), pages 1263–1266.

the alternative lemmatisation is described in:

Rico Sennrich and Beat Kunz. 2014: Zmorge: A German Morphological Lexicon Extracted from Wiktionary. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014).

About

SMOR (Stuttgart Morphology) with alternative lemmatization component

License:GNU General Public License v2.0


Languages

Language:Lex 56.5%Language:F* 41.2%Language:Python 1.3%Language:Makefile 0.6%Language:XSLT 0.4%