UniversalDependencies / UD_Spanish-GSD

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lemma discrepancy for `cuenta`

AngledLuffa opened this issue · comments

Quite a few instances of cuenta tagged as a NOUN are given the verb lemma, contar. For example, this sentence. Not sure if they should be retagged as VERB or relemmatized... my Spanish is not what it should be

# sent_id = es-train-002-s409
# text = Al darse cuenta de que se trataba de una estrategia anunciada por sus opositores para mantener al público al tomarlo en serio, Kennedy declaró con franqueza: "No me estoy esforzando para ser vicepresidente, me estoy esforzando para presidente".
1       Al      al      ADP     _       _       2       mark    _       _
2-3     darse   _       _       _       _       _       _       _       _
2       dar     dar     VERB    _       VerbForm=Inf    29      advcl   _       _
3       se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      2       expl:pv _       _
4       cuenta  contar  NOUN    _       Number=Sing     2       obj     _       _
5       de      de      ADP     _       _       8       case    _       _
6       que     que     SCONJ   _       _       8       mark    _       _
7       se      él      PRON    _       Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes      8       expl:pv _       _
8       trataba tratar  VERB    _       Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin    4       advcl   _       _
...

This one is NOUN and the lemma should be cuenta but we cannot say that this will be the correct solution in all instances.

cat *.conllu | udapy util.Eval node='if node.form.lower() == "cuenta": print(node.upos, node.lemma, node.feats)' | sort | uniq -c | sort -rn
     90 VERB contar Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
     31 NOUN cuenta Gender=Fem|Number=Sing
     28 NOUN contar Gender=Fem|Number=Sing
     15 NOUN contar Number=Sing
      1 VERB cuenta Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
      1 VERB contar Gender=Fem|Number=Sing|VerbForm=Fin

The lemmas and features in Spanish GSD have been assigned automatically by a model trained on AnCora; I don't think that the model considered the manual UPOS tags as input. Some words were later fixed using heuristic scripts or even arbitrary manual intervention, but many other lemma-related issues are to be expected in this treebank.

Definitely. That's what I've been trying to do.

It was just the lemmas and features that were added this way. But it does not mean that the things that were annotated manually are always correct :-)

Do you need some help distinguishing / fixing this particular case, or is it something you can handle?

     99 VERB contar Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
     67 NOUN cuenta Gender=Fem|Number=Sing

The fix will appear in UD 2.14.