aryamanarora / carmls-hi

Hindi SNACS (Semantic Network of Adposition and Case Supersenses; Schneider et al., 2018) annotation scheme and guidelines.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Validator issues

aryamanarora opened this issue · comments

1_9_5: "छह साल का" Characteristic~Characteristic

15_35 redo sentence

73,727 errors. Summary below. Attached detailed errors in text file. Working to resolve.

Error, Count
the full tag at the end of the line is inconsistent with the rest of the line,16882
MWE lemma is incorrect,14874
Single word expression lemma doesn't match token lemma,14602
invalid lexcat,13585
SWE token must have lexcat., 13585
Invalid supersense(s) in lexical entry, 107
single-word expression .. has lexcat .. which is incompatible with its upos, 68
lexlemma appears incorrect for smwe, 16
unexpected construal: p.PartPortion ~> p.Characteristic, 4
Token is the beginning of a SMWE, but lexlemma doesn't appear to have multiple tokens in it. , 2
unexpected construal: p.Characteristic ~> p.QuantityValue, 1
unexpected construal: p.Gestalt ~> p.Whole, 1

Total 73727

error.txt

all errors of type 'Single word expression lemma doesn't match token lemma' are cleared. Remaining 45,767 errors to clean.

Decisions taken:

  1. Irregular pronouns for all adpositions were put in the LEXLEMMA column and get the SNACS labels
  2. Regular pronouns where adposition is obvious, were split into two tokens, second one being the adposition. The adposition gets the SNACS label.

Slowly resolving : Invalid supersenses in lexical entry. But there are some general issues to discuss. Roughly 17.5K errors to resolve now.

  1. Some pronouns are irregular and it's not straightforward to extract the adposition from the irregular pronoun token. This includes genitive, dative, accusative irregular forms, there's no theory I could find behind whether the irregular genitive pronoun does indeed separate into the oblique pronoun form and the genitive post-position. There is some theory supporting this split for irregular accusatives / datives though. What has been done is to endow the irregular pronoun (therefore, the PRON lexcat) with preposition supersenses for Hindi to support annotation of irregular pronouns across multiple cases with SNACS labels.

  2. Some PARTicles (to, bhi, hi, saa) are annotated with the FOCUS label. These may be revisited in v2.7 guidelines, so for now the PART lexcat (newly created for Hindi) has also been endowed with preposition supersenses. The particle 'saa' was annotated by us with non-focus supersenses, i'm now wondering if this was an error as this particle is not typically in a governor-object construction (it loosely translates to the suffix -like and attaches to nouns to make them adjectives).

  3. Some tokens marked SCONJ are given preposition supersenses: exceptions were created for these by marking lexcat = P. The tokens are 'तो','जैसे','ताकि'. तो will be revisited in v2.7 as it is a FOCUS marker; other two should be revisited and supersense label potentially removed.

  4. Some tokens marked ADV are given preposition supersenses; exceptions were created by marking lexcat = P. These tokens are: जैसे,सबसे,जैसे ही,फिर से,पहले,आगे,बाद में

  5. Some tokens marked ADJ are given preposition supersenses as exceptions (lexcat = P). These are: जैसे-जैसे. This is the same जैसे that is marked ADV above, and the ADJ token loosely translates in its context as 'as' (as the time passes, my happiness will increase). [lp_hi_21-81]

Some specific issues to discuss with the error: Invalid supersenses in lexical entry:

  1. sabse pehle [lp_hi_13-85] : Previously extracted and annotated 'se pehle' as the lexlemma but i'm not sure now that the 'se' can be extracted from the 'sabse' as a separate token as 'sabse' has a specific superlative meaning. I would re-think the adposition as just plain 'pehle' and annotate that, leaving the 'sabse' unannotated. The alternative is to endow a superlative adjective (sabse') with preposition supersenses in the validator, which is weird.

  2. ek saath [lp_hi_14-59] : The adposition has been annotated as an MWE expression 'ek saath' which I don't think is correct. I would annotate just the 'saath'.

  3. aisa [lp_hi_14-76]: Has been tagged DET by the UD tagger which i think is legal, it may also be a pre-determiner 'such'. We may have tagged this with preposition supersenses by mistake.
    EDIT: Exception just for 'aisa' has been created; it's lexcat is assigned P. This should be a temporary measure until a decision to remove all tags for 'aisa' is reached.

  4. 'chaaya-sii aakriti' [lp_hi_2-20]: UD has tagged the 'sii' as ADJ, i think it may be a PARTicle following the chaaya. We may have also got the SNACS label wrong (shouldn't be ComparisonRef, maybe Extent?). In general also unsure of whether we should be annotating particles with SNACS labels (except maybe the Focus-related ones).

  5. [lp_hi_20-14] - ताकि is SCONJ but annotated with p.`d. Removed the annotation.

MWE lemma is incorrect: some changes made to resolve these are listed:

  1. for MWE expressions, the lexlemma is checked against the word/form instead of the lemma. For single expressions, it's still checked against the lemma
  2. MWE expressions where the irregular pronoun is part of the expression that receives a supersense: it's weird to have the irregular pronoun as part of the MWE and there's no concrete theory to split the pronoun into oblique and post-position for irregular genitives (accusatives / datives don't form MWE expressions here). Keeping the irregular genitive in the MWE is in line with the earlier decision to endow these irregular genitives with supersenses (for single-tokens).
  3. Sometimes the MWE tokens are flipped in order e.g [lp_hi_8-53], which has बावजूद के instead of के बावजूद. The LEXLEMMA follows this flipped order.

Suggest removing the supersense labels on these tokens altogether.

lp_hi_14_75: single-word expression 'ऐसा' has lexcat P, which is incompatible with its upos DET
lp_hi_15_36: single-word expression 'ऐसा' has lexcat P, which is incompatible with its upos DET
lp_hi_26_72: single-word expression 'जैसे' has lexcat P, which is incompatible with its upos ADJ
lp_hi_21_80: single-word expression 'जैसे-जैसे' has lexcat P, which is incompatible with its upos ADJ

lp_hi_13_84: pehle marked with Time but is ADV.
lp_hi_14_33: same as previous (pehle)

Missing supersense annotation in lexical entry. 30 entries to discuss, attached. These are attached.
missing_supersense.txt

Sentence ids were updated. E.g lp_hi_13-98 in the new version is lp_hi_13_97 in the old one.

To be resolved here in this sheet, 'missing' tab.

lp_hi_10_74: interesting case because there is an implied argument in a relative clause here which is being marked by the explicit postposition 'ke'.

This is not an adposition, it's an alternative spelling of the complementiser कि (borrowed from Persian ke).

This is not an adposition, it's an alternative spelling of the complementiser कि (borrowed from Persian ke).

Hmm, I don't think so. This can be thought of as:

हर किसी से उसी बात की अपेक्षा रखनी चाहिए जिस (बात) के वह लायक हो

Which is in line with Koul's examples in his grammar book, where the head noun is elided from the relative clause when it follows the main clause. The adposition is marking the implied argument here.

Conllulex file passes validation. Some decisions on irregular pronouns need to be taken (one of the options below):

  • Annotate all pronouns (incl. NOM) and nonpronominal case markers. Inconsistency: nominative nouns are not annotated but nominative pronouns are.
  • Annotate all non-NOM pronouns and nonpronominal case markers. Inconsistency: some pronouns aren’t annotated—lexcat PRON.NOM.
  • Status quo: Split off case suffixes from pronouns where possible, and annotate irregular non-NOM pronouns as well—lexcat PRON.IRREG. Inconsistency: some pronouns are split and others aren’t, but they are annotated either way so the splitting is extra complexity with no clear benefit.

New Causer label needs a review of these cases

Decisions:

  • Leave all nominative nouns/pronouns unannotated (at least until version 2). Oblique pronouns are also not annotated. PRON.NOM, PRON.OBL lexcats to signal that there is no SNACS label
  • Pronouns should be single tokens, preserving the UD tokenization. PRON lexcat for non-nominatives. Lexlemma should reflect the case marker only, so as to semantically group with nouns with that case marker.
    • Genitive: ki/ka/ke -> ka for the lexlemma

2. Some PARTicles (to, bhi, hi, saa) are annotated with the FOCUS label. These may be revisited in v2.7 guidelines, so for now the PART lexcat (newly created for Hindi) has also been endowed with preposition supersenses. The particle 'saa' was annotated by us with non-focus supersenses, i'm now wondering if this was an error as this particle is not typically in a governor-object construction (it loosely translates to the suffix -like and attaches to nouns to make them adjectives).

Distribution is not quite the same as postpositions, hence the PART tag. Focus is deterministic given lemma, so it's not like we're adding a lot of disambiguation here.

"us-i ki" meaning 'his (emphatic)'. Genitive "ki" gets a supersense. PRON.OBL for "usi", if we don't annotate Focus.

Pros of including Focus:

  • it's what we had in the paper (including stats and model results)
  • it's a form of glossing that is helpful for crosslinguistic analysis
  • Korean has been annotating Focus markers

Cons of including Focus:

  • these particles are not quite adpositions, so it's not clear why we'd need to include them
  • Focus is not part of universal guidelines yet
  • not really disambiguating because (lemma+POS) is enough

Decision:

  • Stick with the status quo (keep them as Focus). Lexcat: PART.FOC
    • "usi" lexcat PART.FOC

3. Some tokens marked SCONJ are given preposition supersenses: exceptions were created for these by marking lexcat = P. The tokens are 'तो','जैसे','ताकि'. तो will be revisited in v2.7 as it is a FOCUS marker; other two should be revisited and supersense label potentially removed.

4. Some tokens marked ADV are given preposition supersenses; exceptions were created by marking lexcat = P. These tokens are: जैसे,सबसे,जैसे ही,फिर से,पहले,आगे,बाद में

5. Some tokens marked ADJ are given preposition supersenses as exceptions (lexcat = P). These are: जैसे-जैसे. This is the same जैसे that is marked ADV above, and the ADJ token loosely translates in its context as 'as' (as the time passes, my happiness will increase). [lp_hi_21-81]

For these, add an exception to the validator allowing the lexcat to not match the UPOS

"sab-se" 'than-all/everyone', used to convey superlative meaning: treat as weak MWE

Decisions incorporated in #34 and merged with the master. Pending treatment of sab-se as weak MWE. Pending updating guidelines with vala discussion from #29

All Force/Causer fixes incorporated. Data passes the validator (it already was before fixes too), so seems like this is the finalised version of the corpus and can be ingested into Xposition.