Lemma discrepancy for `cuenta`
AngledLuffa opened this issue · comments
Quite a few instances of cuenta
tagged as a NOUN are given the verb lemma, contar
. For example, this sentence. Not sure if they should be retagged as VERB or relemmatized... my Spanish is not what it should be
# sent_id = es-train-002-s409
# text = Al darse cuenta de que se trataba de una estrategia anunciada por sus opositores para mantener al público al tomarlo en serio, Kennedy declaró con franqueza: "No me estoy esforzando para ser vicepresidente, me estoy esforzando para presidente".
1 Al al ADP _ _ 2 mark _ _
2-3 darse _ _ _ _ _ _ _ _
2 dar dar VERB _ VerbForm=Inf 29 advcl _ _
3 se él PRON _ Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes 2 expl:pv _ _
4 cuenta contar NOUN _ Number=Sing 2 obj _ _
5 de de ADP _ _ 8 case _ _
6 que que SCONJ _ _ 8 mark _ _
7 se él PRON _ Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes 8 expl:pv _ _
8 trataba tratar VERB _ Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin 4 advcl _ _
...
This one is NOUN
and the lemma should be cuenta but we cannot say that this will be the correct solution in all instances.
cat *.conllu | udapy util.Eval node='if node.form.lower() == "cuenta": print(node.upos, node.lemma, node.feats)' | sort | uniq -c | sort -rn
90 VERB contar Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
31 NOUN cuenta Gender=Fem|Number=Sing
28 NOUN contar Gender=Fem|Number=Sing
15 NOUN contar Number=Sing
1 VERB cuenta Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
1 VERB contar Gender=Fem|Number=Sing|VerbForm=Fin
The lemmas and features in Spanish GSD have been assigned automatically by a model trained on AnCora; I don't think that the model considered the manual UPOS tags as input. Some words were later fixed using heuristic scripts or even arbitrary manual intervention, but many other lemma-related issues are to be expected in this treebank.
Definitely. That's what I've been trying to do.
It was just the lemmas and features that were added this way. But it does not mean that the things that were annotated manually are always correct :-)
Do you need some help distinguishing / fixing this particular case, or is it something you can handle?
99 VERB contar Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
67 NOUN cuenta Gender=Fem|Number=Sing
The fix will appear in UD 2.14.