nytud / panmorph

Tagsets and description of Hungarian morphological analysers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

panmorph

Tagsets and description of Hungarian morphological analysers.

Tagsets

MSD (Morphosyntactic Description)

MSD provides harmonised lexical specifications for ten languages, including Hungarian.

The description of version 3.0 is available here.

Morphological information is represented in attribute-value pairs, where attributes are marked by positions and values are represented by a single character. The non-applicability of a given attribute is marked by a hyphen. Position 0 encodes part-of-speech, other positions encode other morphological attributes, such as person, number, case.

For example, Vmis2s---y is the code of a main verb in indicative mode, past tense, second person singular, definite conjugation (e.g. adtad).

This tagset was used in Szeged Treebank 1.0 and 2.0, and this was the output formalism of versions 1.0 and 2.0 of magyarlanc, a toolkit for linguistic processing of Hungarian, as well. Later, a harmonized MSD-KR tagset has been developed, which is a slightly modified version of the original MSD. This tagset is used in Szeged Corpus and Treebank 2.5 and in magyarlanc 2.0. Here we refer to the latter version as MSD.

The tagset is available in two formats:

  • msd.tsv: possible tags of msd scheme. A wordlist was extracted from two sources:
  1. the 100000 most frequent words of Webcorpus
  2. tokens of Szeged Treebank These words were morphologically analyzed with magyarlanc 2.0, the list contains only the tags.
  • msd.pdf: documentation of the scheme with detailed description of possible values assigned to each POS-tags. The documentation includes co-occurrence matrices.

CoNLL

This is not a tagset or an annotation scheme, just a format, actually the file format of the CoNLL-2009 shared task Syntactic and Semantic Dependencies in Multiple Languages. The original MSD codes are converted into a linearized format of attribute-value pairs. The code at position 0 is separated as the POS tag, while the other morphosyntactic attributes are in a linear order based on the MSD positions. Non-applicable attributes have 'none' value. For example, the code for the Hungarian verb form adtad mentioned above is: V SubPOS=m|Mood=i|Tense=s|Per=2|Num=s|Def=y

The tagset is available in two formats:

  • conll.tsv: sorted list of tags of Szeged Treebank
  • conll.pdf: documentation of the scheme with detailed description of possible values assigned to each POS-tags.

UD

Universal Dependencies (UD) is a framework for cross-linguistically consistent grammatical annotation for over 70 languages including Hungarian. The morphological specification of a word in the UD scheme consists of three levels of representation: lemma, part-of-speech (POS) tag, and feature-value pairs representing morphosyntactic properties of the word. The latter attributes are in a linearized format, in alphabetical order. Every feature has the form Name=Value, and features are separated by a vertical bar, such as in Case=Nom|Number=Sing. Non-applicable features must not be present in the list of feature--value pairs. The UD framework focuses on syntax, therefore its morphological representation encodes only those phenomena which are important for the syntax, which typically are the inflectional codes.

At the time of writing, the second version of UD is out. However, the development of UD2 for Hungarian and the conversion of Hungarian resources in UD1 are not yet available. Therefore, here we use 'UD' referring to UD1. This document is based on the documentation of UD1 applied for Hungarian. The Szeged Dependency Treebank has a version converted to the format of UD, and magyarlanc 3.0 also outputs UD morphological annotation.

The tagset is available in two formats:

  • ud.tsv: possible tags of msd scheme. A wordlist was extracted from two sources:
  1. the 100000 most frequent words of Webcorpus
  2. tokens of Szeged Treebank These words were morphologically analyzed with magyarlanc 2.0, the list contains only the tags.
  • ud.pdf: documentation of the scheme with detailed description of possible values assigned to each POS-tags. The documentation includes co-occurrence matrices.

emMorph

The webpage of e-magyar, a toolchain for processing Hungarian lists all possible tags of emMorph. This list needs to be completed, two further possible tags were found:

tag description example analysis
[_VAdjz:nivaló/Adj] nominalizer suffix ('to be ...-d') -nivaló > adjective imádnivaló imád[/V]nivaló[_VAdjz:nivaló/Adj][Nom]
[Inl] locative case suffix -Ott/-t Győrött Győr[/N]ött[Inl]

Locative case suffix tag is incorrect on the website, [Inl] is the correct tag instead of [Loc].

The tagset is available in one format:

  • emmorph.tsv: possible tags of emMorph scheme. A wordlist was extracted from two sources:
  1. the 100000 most frequent words of Webcorpus
  2. tokens of Szeged Treebank

Morphologically analyzed words of Hungarian

We analyzed the 100000 most frequent words of Webcorpus with emMorph and two versions of magyarlanc (magyarlanc2.0 and magyarlanc 3.0). Due to morphological ambiguity multiple analyses might be assigned to a word. The json file (webcorpus_alltags.json) contains morphological tags of these tools assigned to each word of the wordlist.

No manual corrections were carried out on the tags, therefore the list may contain errors. The analysis were done on november of 2018.

Converters

  • emmorph2msd is available here.
  • emmorph2conll is available here.
  • emmorph2ud is available here.

Citation

If you use this tool or any parts of its documentation, please refer to:

Vadász, Noémi; Simon, Eszter: Konverterek magyar morfológiai címkekészletek között. In: Berend, Gábor; Gosztolya, Gábor; Vincze, Veronika (szerk.) XV. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Magyarország: Szegedi Tudományegyetem, Informatikai Intézet (2019), pp. 99-111.

About

Tagsets and description of Hungarian morphological analysers.

License:Creative Commons Attribution Share Alike 4.0 International