Missing CorrectForm and Typo annotations in multi-word tokens
rhdunn opened this issue · comments
For:
# sent_id = newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0024
# text = I havn't heard of it.
1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 4 nsubj 4:nsubj _
2-3 havn't _ _ _ _ _ _ _ _
2 hav have AUX VBP Mood=Ind|Number=Sing|Person=1|Tense=Pres|Typo=Yes|VerbForm=Fin 4 aux 4:aux CorrectForm=have
3 n't not PART RB _ 4 advmod 4:advmod _
4 heard hear VERB VBN Tense=Past|VerbForm=Part 0 root 0:root _
5 of of ADP IN _ 6 case 6:case _
6 it it PRON PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 4 obl 4:obl:of SpaceAfter=No
7 . . PUNCT . _ 4 punct 4:punct _
there is a CorrectForm
annotation on the internal word of the multi-word token, but there is no corresponding Typo=Yes
+ CorrectForm
annotation on the multi-word token itself. Is this intentional? -- This makes it difficult to extract the correct form when only viewing the tokens. It also makes validation of multi-word forms difficult, as the repaired (corrected) text in the word stream differs from the token stream.
I've also noticed several missing annotations in the data (token and word) for multi-word tokens, e.g.:
# sent_id = reviews-202709-0002
# newpar id = reviews-202709-p0002
# text = All I can say is that Elmira you are the best Ive experienced, never before has the seamstress done a perfect job until i met you.
1 All all DET DT _ 11 nsubj:outer 11:nsubj:outer _
2 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 4 nsubj 4:nsubj _
3 can can AUX MD VerbForm=Fin 4 aux 4:aux _
4 say say VERB VB VerbForm=Inf 1 acl:relcl 1:acl:relcl _
5 is be AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 11 cop 11:cop _
6 that that SCONJ IN _ 11 mark 11:mark _
7 Elmira Elmira PROPN NNP Number=Sing 11 vocative 11:vocative _
8 you you PRON PRP Case=Nom|Person=2|PronType=Prs 11 nsubj 11:nsubj _
9 are be AUX VBP Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin 11 cop 11:cop _
10 the the DET DT Definite=Def|PronType=Art 11 det 11:det _
11 best good ADJ JJS Degree=Sup 0 root 0:root _
12-13 Ive _ _ _ _ _ _ _ _
12 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 14 nsubj 14:nsubj _
13 ve have AUX VBP Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin 14 aux 14:aux _
14 experienced experience VERB VBN Tense=Past|VerbForm=Part 11 acl:relcl 11:acl:relcl SpaceAfter=No
15 , , PUNCT , _ 11 punct 11:punct _
16 never never ADV RB _ 17 advmod 17:advmod _
17 before before ADV RB _ 21 advmod 21:advmod _
18 has have AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 21 aux 21:aux _
19 the the DET DT Definite=Def|PronType=Art 20 det 20:det _
20 seamstress seamstress NOUN NN Number=Sing 21 nsubj 21:nsubj _
21 done do VERB VBN Tense=Past|VerbForm=Part 11 parataxis 11:parataxis _
22 a a DET DT Definite=Ind|PronType=Art 24 det 24:det _
23 perfect perfect ADJ JJ Degree=Pos 24 amod 24:amod _
24 job job NOUN NN Number=Sing 21 obj 21:obj _
25 until until SCONJ IN _ 27 mark 27:mark _
26 i I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 27 nsubj 27:nsubj _
27 met meet VERB VBD Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin 21 advcl 21:advcl:until _
28 you you PRON PRP Case=Nom|Person=2|PronType=Prs 27 obj 27:obj SpaceAfter=No
29 . . PUNCT . _ 11 punct 11:punct _
I can create a full list of sentences with these issues.
there is a
CorrectForm
annotation on the internal word of the multi-word token, but there is no correspondingTypo=Yes
+CorrectForm
annotation on the multi-word token itself. Is this intentional?
Yes, per https://universaldependencies.org/u/overview/typos.html#misspelled-multiword-token it should be placed on the internal word if the multiword token is concatenative.
I can create a full list of sentences with these issues.
Yes please!
Here's the list. There are certainly going to be some valid cases in this list, as I'm using an automated validation check to identify unknown multi-word token values (along with the corresponding words it splits into) and there will be multi-word tokens I don't have entries for.
ERROR: Sentence email-enronsent08_01-0016 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence email-enronsent29_01-0046 token 5-6 -- unrecognized multi-word token form 'your'
ERROR: Sentence newsgroup-groups.google.com_RagnarokOnlineII_acbece2a311cfb3c_ENG_20051119_076100-0002 token 24-25 -- unrecognized multi-word token form 'iwas'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_2044a3376e5a87a5_ENG_20040529_135300-0002 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_2044a3376e5a87a5_ENG_20040529_135300-0003 token 33-34 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111107224336AAxQbzk_ans-0002 token 2-3 -- unrecognized multi-word token form 'ill'
ERROR: Sentence answers-20111108102900AA9qsc8_ans-0004 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108102900AA9qsc8_ans-0006 token 1-2 -- unrecognized multi-word token form 'Thats'
ERROR: Sentence answers-20111105140228AANN2ZV_ans-0004 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108084227AAtbjAp_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084227AAtbjAp_ans-0005 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108083850AAzIsFI_ans-0001 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108072305AAPJTjj_ans-0003 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0002 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0007 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107154308AAKOZNX_ans-0007 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107080027AA9zCIG_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108104636AAw51HV_ans-0005 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108024148AAO8oFI_ans-0003 token 9-10 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108024148AAO8oFI_ans-0004 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0002 token 9-10 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111107115952AAqfsHV_ans-0004 token 1-2 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108105146AAtiEx7_ans-0010 token 2-3 -- unrecognized multi-word token form 'cats'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0002 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 10-11 -- unrecognized multi-word token form 'havent'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0004 token 16-17 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108071348AAWu2FU_ans-0005 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-389136-0003 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-194313-0002 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-158740-0003 token 14-15 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-202709-0001 token 3-4 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-202709-0002 token 12-13 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence email-enronsent23_05-0001 token 1-2 -- unrecognized multi-word token form 'your'
ERROR: Sentence email-enronsent18_02-0031 token 9-10 -- unrecognized multi-word token form 'Cox''
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0001 token 2-3 -- unrecognized multi-word token form 'I´m'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0006 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0009 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108084149AAbQBhq_ans-0003 token 8-9 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111024202518AA18Sg7_ans-0003 token 7-8 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111024202518AA18Sg7_ans-0003 token 19-20 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108075412AA4d7Up_ans-0005 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108084122AAYLqSQ_ans-0003 token 3-4 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108084122AAYLqSQ_ans-0003 token 10-11 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108081519AAdHz5c_ans-0002 token 11-12 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108071652AA8GAZw_ans-0005 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108071652AA8GAZw_ans-0007 token 4-5 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111107035344AAdi9dS_ans-0002 token 2-3 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111104115933AA30CRJ_ans-0005 token 18-19 -- unrecognized multi-word token form 'thatd'
ERROR: Sentence answers-20111108103704AAB0G7y_ans-0002 token 11-12 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108103704AAB0G7y_ans-0003 token 42-43 -- unrecognized multi-word token form 'shes'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0007 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0008 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0008 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107221352AAlIioO_ans-0009 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105137AA9BNtk_ans-0010 token 1-2 -- unrecognized multi-word token form 'heres'
ERROR: Sentence answers-20110320195750AAkPbFG_ans-0003 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111106015552AAj6rCu_ans-0002 token 30 -- unexpected multi-word token 'donalds' part upos 'X', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108111112AAAjhoy_ans-0003 token 9-10 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111107200249AAIyCy5_ans-0004 token 2-3 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111106230959AAuYQ5Q_ans-0005 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-104703-0002 token 4-5 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence reviews-155050-0002 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-241108-0004 token 6-7 -- unrecognized multi-word token form 'NOTto'
ERROR: Sentence reviews-396046-0002 token 1-2 -- unrecognized multi-word token form 'DONt'
ERROR: Sentence reviews-200566-0003 token 6-7 -- unrecognized multi-word token form 'IVE'
ERROR: Sentence reviews-039173-0001 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-039173-0002 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-229100-0005 token 1-2 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence reviews-103519-0002 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-309258-0003 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-107608-0002 token 1-2 -- unrecognized multi-word token form 'Iv'
ERROR: Sentence reviews-048201-0003 token 11-12 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence weblog-blogspot.com_dakbangla_20050210141134_ENG_20050210_141134-0038 token 18-19 -- unrecognized multi-word token form 'Inter-Services'
ERROR: Sentence weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0076 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0238 token 7-8 -- unrecognized multi-word token form 'dont'
ERROR: Sentence email-enronsent08_02-0009 token 2-3 -- unrecognized multi-word token form 'Mama`s'
ERROR: Sentence email-enronsent08_02-0017 token 2-3 -- unrecognized multi-word token form 'driver`s'
ERROR: Sentence email-enronsent08_02-0020 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0020 token 28-29 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0022 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0023 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0024 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent08_02-0024 token 7-8 -- unrecognized multi-word token form 'she`s'
ERROR: Sentence email-enronsent08_02-0025 token 2-3 -- unrecognized multi-word token form 'mama`s'
ERROR: Sentence email-enronsent17_01-0044 token 6-7 -- unrecognized multi-word token form 'wont'
ERROR: Sentence email-enronsent15_01-0034 token 1-2 -- unrecognized multi-word token form 'Your'
ERROR: Sentence email-enronsent10_01-0020 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence email-enronsent37_01-0056 token 15-16 -- unrecognized multi-word token form 'dont'
ERROR: Sentence email-enronsent37_01-0056 token 18-19 -- unrecognized multi-word token form 'its'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0011 token 14-15 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0020 token 11-12 -- unrecognized multi-word token form 'dont'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0020 token 21-22 -- unrecognized multi-word token form 'thats'
ERROR: Sentence newsgroup-groups.google.com_alt.animals.badgers_172dcd8baf26948f_ENG_20040823_121900-0022 token 7-8 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0013 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0036 token 4-5 -- unrecognized multi-word token form 'PEREZ''
ERROR: Sentence newsgroup-groups.google.com_humanities.lit.authors.shakespeare_0018a7697318f71f_ENG_20031006_163200-0043 token 27-28 -- unrecognized multi-word token form 'Essex''
ERROR: Sentence answers-20111107152509AA78ktV_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108105559AAkQd38_ans-0003 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108105559AAkQd38_ans-0004 token 1-2 -- unrecognized multi-word token form 'ive'
ERROR: Sentence answers-20111108094323AARaBJ5_ans-0001 token 7-8 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108083309AAg9jwT_ans-0002 token 12-13 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20110721164531AA3BGSJ_ans-0007 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20110721164531AA3BGSJ_ans-0009 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108100918AATaSIx_ans-0007 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 1-2 -- unrecognized multi-word token form 'iv'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 18-19 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 27-28 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0006 token 25-26 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0007 token 17-18 -- unrecognized multi-word token form 'ur'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0008 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0011 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0011 token 6-7 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108085734AATXy0E_ans-0002 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108093211AA8bYFE_ans-0002 token 64-65 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108104228AA6z9uZ_ans-0002 token 110-111 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0002 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0003 token 16-17 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111107212131AACQ65F_ans-0013 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108083256AAnI6Wt_ans-0005 token 1-2 -- unrecognized multi-word token form 'Whats'
ERROR: Sentence answers-20111108110610AA4bcXX_ans-0021 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110610AA4bcXX_ans-0021 token 7-8 -- unrecognized multi-word token form 'itll'
ERROR: Sentence answers-20111108085945AAgJhOG_ans-0013 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111107194805AAdINwt_ans-0012 token 12-13 -- unrecognized multi-word token form 'd'Orleans'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0003 token 5-6 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0018 token 9-10 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108080100AAJHNUK_ans-0018 token 14-15 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111108110044AA4rs9f_ans-0007 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110044AA4rs9f_ans-0010 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108094831AAnOjgr_ans-0001 token 1-2 -- unrecognized multi-word token form 'Whats'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0002 token 23-24 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0004 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108103354AAQzdFB_ans-0007 token 3-4 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0007 token 2-3 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111107233110AAmgsVx_ans-0013 token 13-14 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 18-19 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 42-43 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44-45 -- unrecognized multi-word base form 'wa' for suffix 'na'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44 -- unexpected multi-word token 'wana' part form 'wan', expected 'wa'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 45 -- unexpected multi-word token 'wana' part form 'a', expected 'na'
ERROR: Sentence answers-20111108105919AAHXkZF_ans-0014 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108065707AAj7DaH_ans-0002 token 2-3 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0016 token 7-8 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0018 token 17-18 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108103914AAcdeIt_ans-0019 token 40-41 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108102133AAwVd7m_ans-0006 token 2-3 -- unrecognized multi-word token form 'cant'
ERROR: Sentence answers-20111108102133AAwVd7m_ans-0025 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0003 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0004 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 2-3 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 8-9 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0005 token 40-41 -- unrecognized multi-word token form 'doesnt'
ERROR: Sentence answers-20111106144630AAadR8l_ans-0005 token 4-5 -- unrecognized multi-word token form 'thes'
ERROR: Sentence answers-20111108094504AAKrc8F_ans-0015 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 16-17 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 27-28 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111107193044AAvUYBv_ans-0014 token 3-4 -- unrecognized multi-word token form 'your'
ERROR: Sentence answers-20111108111128AAwfype_ans-0009 token 21-22 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0006 token 6-7 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0010 token 2-3 -- unrecognized multi-word token form 'hes'
ERROR: Sentence answers-20111108090616AAv6fpU_ans-0017 token 19-20 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102810AAfCh1W_ans-0019 token 1-2 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0004 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0006 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0010 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108104206AAygiaE_ans-0011 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108104350AAp4hGP_ans-0009 token 19-20 -- unrecognized multi-word token form 'youre'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 47-48 -- unrecognized multi-word token form 'id'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 60-61 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111107AAlrzok_ans-0028 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0022 token 2-3 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence answers-20111108064026AA86V9T_ans-0025 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108102428AAMzXRG_ans-0006 token 7-8 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108102428AAMzXRG_ans-0009 token 4-5 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092321AAK0Eqp_ans-0012 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105749AABv7vx_ans-0004 token 10-11 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0006 token 17-18 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0007 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0011 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0012 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108111031AARG57j_ans-0015 token 48-49 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0017 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111031AARG57j_ans-0018 token 6-7 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0009 token 22-23 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0019 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108103957AAcF3iZ_ans-0024 token 25-26 -- unrecognized multi-word token form 'theres'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0004 token 13-14 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0011 token 10-11 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0012 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0015 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0015 token 14-15 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0016 token 16-17 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108111312AAq4ETn_ans-0021 token 31-32 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0022 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0033 token 5-6 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence answers-20111108110329AAxl1pb_ans-0010 token 22-23 -- unrecognized multi-word token form 'im'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0030 token 40-41 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108110012AAK8Azy_ans-0037 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0002 token 60-61 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0007 token 1-2 -- unrecognized multi-word token form 'Heres'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0014 token 1-2 -- unrecognized multi-word token form 'Ive'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0015 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0023 token 7-8 -- unrecognized multi-word token form 'arent'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0039 token 3-4 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0055 token 12-13 -- unrecognized multi-word token form 'its'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0062 token 19-20 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0063 token 6-7 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0067 token 1-2 -- unrecognized multi-word token form 'Dont'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0073 token 2-3 -- unrecognized multi-word token form 'thats'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0044 token 6-7 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0053 token 5-6 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0055 token 18-19 -- unrecognized multi-word token form 'Its'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 1-2 -- unrecognized multi-word base form 'sor' for suffix 'ta'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 1 -- unexpected multi-word token 'sorta' part form 'sort', expected 'sor'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0058 token 2 -- unexpected multi-word token 'sorta' part form 'a', expected 'ta'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0062 token 1-2 -- unrecognized multi-word token form 'Im'
ERROR: Sentence answers-20111108104724AAuBUR7_ans-0016 token 2-3 -- unrecognized multi-word token form 'CANNOT'
ERROR: Sentence reviews-267793-0003 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-267793-0005 token 4-5 -- unrecognized multi-word token form 'hes'
ERROR: Sentence reviews-063690-0003 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-034813-0004 token 11-12 -- unrecognized multi-word token form 'c'mon'
ERROR: Sentence reviews-187875-0001 token 4-5 -- unrecognized multi-word token form 'DONT'
ERROR: Sentence reviews-187875-0007 token 10-11 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-285133-0001 token 30-31 -- unrecognized multi-word token form 'ive'
ERROR: Sentence reviews-063549-0002 token 1-2 -- unrecognized multi-word token form 'Theres'
ERROR: Sentence reviews-020851-0002 token 13-14 -- unrecognized multi-word token form 'Jack-s'
ERROR: Sentence reviews-020851-0005 token 9-10 -- unrecognized multi-word token form 'you-ll'
ERROR: Sentence reviews-215460-0004 token 16-17 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-243799-0003 token 2-3 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-243799-0004 token 2-3 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-243799-0006 token 5-6 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-100592-0003 token 2-3 -- unrecognized multi-word token form 'wasnt'
ERROR: Sentence reviews-015148-0002 token 12-13 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-015148-0003 token 8-9 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-183172-0004 token 22-23 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-069995-0007 token 7-8 -- unrecognized multi-word token form 'youll'
ERROR: Sentence reviews-360698-0001 token 1-2 -- unrecognized multi-word token form 'Its'
ERROR: Sentence reviews-324337-0001 token 1-2 -- unrecognized multi-word token form 'DONT'
ERROR: Sentence reviews-326439-0005 token 8-9 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-326439-0008 token 5-6 -- unrecognized multi-word token form 'OUTTA'
ERROR: Sentence reviews-223912-0001 token 12-13 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-223912-0001 token 25-26 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-280340-0003 token 3-4 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-317846-0008 token 9-10 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-255261-0010 token 17-18 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-159371-0006 token 9-10 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-121342-0010 token 8-9 -- unrecognized multi-word token form 'wont'
ERROR: Sentence reviews-217359-0008 token 6-7 -- unrecognized multi-word token form 'Im'
ERROR: Sentence reviews-063963-0006 token 5-6 -- unrecognized multi-word token form 'itwill'
ERROR: Sentence reviews-351058-0004 token 32-33 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-247226-0004 token 21-22 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-247226-0005 token 5-6 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-247226-0005 token 16-17 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-280844-0008 token 6-7 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence reviews-295288-0006 token 15-16 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-360937-0005 token 46-47 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018562-0006 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-093655-0002 token 13-14 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-036753-0009 token 30-31 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-207629-0005 token 2-3 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-207629-0006 token 9-10 -- unrecognized multi-word token form 'youre'
ERROR: Sentence reviews-336049-0002 token 18-19 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-181771-0007 token 20-21 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence reviews-079375-0006 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-326649-0007 token 24-25 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-294081-0007 token 1-2 -- unrecognized multi-word token form 'ITS'
ERROR: Sentence reviews-294081-0013 token 21-22 -- unrecognized multi-word token form 'CANT'
ERROR: Sentence reviews-018548-0003 token 1-2 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018548-0004 token 4-5 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-018548-0006 token 16-17 -- unrecognized multi-word token form 'ur'
ERROR: Sentence reviews-018548-0008 token 11-12 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-338429-0008 token 1-2 -- unrecognized multi-word token form 'Thats'
ERROR: Sentence reviews-338429-0018 token 8-9 -- unrecognized multi-word token form 'couldnt'
ERROR: Sentence reviews-330966-0005 token 36-37 -- unrecognized multi-word token form 'didnt'
ERROR: Sentence reviews-330966-0007 token 8-9 -- unrecognized multi-word token form 'its'
ERROR: Sentence reviews-330966-0007 token 13-14 -- unrecognized multi-word token form 'cant'
ERROR: Sentence reviews-398243-0007 token 30-31 -- unrecognized multi-word token form 'into'
ERROR: Sentence reviews-235423-0012 token 2-3 -- unrecognized multi-word token form 'dont'
ERROR: Sentence reviews-351561-0007 token 30-31 -- unrecognized multi-word token form 'thats'
ERROR: Sentence reviews-351561-0014 token 8-9 -- unrecognized multi-word token form 'your'
ERROR: Sentence reviews-043020-0010 token 10-11 -- unrecognized multi-word token form 'Your'
Great, so it looks like most of these are contractions with missing apostrophes. Is it possible to make a script to autofix these, and then the few miscellaneous ones can be fixed by hand?
It should technically be possible, I think. I don't currently have the bandwidth to implement such a script.
OK I implemented some regexes to fix most of these. @rhdunn would you mind spot-checking the corrections and rerunning the script to see if there are any remaining issues?
Thanks. I've rerun the script on the current dev branch with the following results:
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
ERROR: Sentence newsgroup-groups.google.com_HarryPotterAppreciationSociety_a3adbf6ac3dc191c_ENG_20050921_061800-0001 token 2-3 -- unrecognized multi-word token form 'I´m'
ERROR: Sentence answers-20111108075412AA4d7Up_ans-0005 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence reviews-200566-0003 token 6-7 -- unrecognized multi-word token form 'IVE'
ERROR: Sentence newsgroup-groups.google.com_GuildWars_086f0f64ab633ab3_ENG_20041111_173500-0013 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108083309AAg9jwT_ans-0002 token 12 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 1-2 -- unrecognized multi-word token form 'iv'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 18-19 -- unrecognized multi-word token form 'dont'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0005 token 27-28 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0006 token 25-26 -- unrecognized multi-word token form 'wouldnt'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0007 token 17-18 -- unrecognized multi-word token form 'ur'
ERROR: Sentence answers-20111108102621AA3hPqj_ans-0008 token 1-2 -- unrecognized multi-word token form 'Theyre'
ERROR: Sentence answers-20111108100523AA1i7no_ans-0002 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111107194805AAdINwt_ans-0012 token 12-13 -- unrecognized multi-word token form 'd'Orleans'
ERROR: Sentence answers-20111108103333AA3eSCk_ans-0004 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108103354AAQzdFB_ans-0007 token 3 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44-45 -- unrecognized multi-word base form 'wa' for suffix 'na'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 44 -- unexpected multi-word token 'wana' part form 'wan', expected 'wa'
ERROR: Sentence answers-20111108084355AAvLpRa_ans-0009 token 45 -- unexpected multi-word token 'wana' part form 'a', expected 'na'
ERROR: Sentence answers-20111108065707AAj7DaH_ans-0002 token 2 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108111010AASEk0S_ans-0003 token 1 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111106144630AAadR8l_ans-0005 token 4 -- unexpected multi-word token 'thes' part upos 'DET', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108094927AA5NjHj_ans-0003 token 16 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108102810AAfCh1W_ans-0019 token 1 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108100419AAKZvMH_ans-0011 token 47-48 -- unrecognized multi-word token form 'id'
ERROR: Sentence answers-20111108105629AAiZUDY_ans-0033 token 5-6 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence answers-20111108110329AAxl1pb_ans-0010 token 22 -- unexpected multi-word token 'im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108092643AAXe4lD_ans-0063 token 6-7 -- unrecognized multi-word token form 'arnt'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0044 token 6 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence answers-20111108091921AAaLK4e_ans-0062 token 1 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence reviews-034813-0004 token 11-12 -- unrecognized multi-word token form 'c'mon'
ERROR: Sentence reviews-100592-0003 token 2-3 -- unrecognized multi-word token form 'wasnt'
ERROR: Sentence reviews-217359-0008 token 6 -- unexpected multi-word token 'Im' part upos 'PRON', expected 'NOUN|PROPN|NUM'
ERROR: Sentence reviews-280844-0008 token 6-7 -- unrecognized multi-word token form 'awhile'
ERROR: Sentence reviews-294081-0007 token 1-2 -- unrecognized multi-word token form 'ITS'
ERROR: Sentence reviews-018548-0006 token 16-17 -- unrecognized multi-word token form 'ur
Note: the im
issues are due to the token having CorrectForm='s
instead of CorrectForm='m
. Because my script doesn't have a direct mapping for I's
, it is falling back to the general noun case which is matching on UPOS, hence the confusing error message.
Thanks, most of these are now fixed.
Some of these are established colloquial forms marked as Abbr=Yes
("wanna", "c'mon") rather than as typos. It looks like the corpus isn't consistent about providing a CorrectForm
on abbreviations: some have it, while a majority do not.
@rhdunn does your script show any issues that still need addressing or should I close this?
I'm still getting the following:
ERROR: Sentence answers-20111108081748AAkQhGe_ans-0003 token 46-47 -- unrecognized multi-word token form 'im'
ERROR: Sentence reviews-332972-0001 token 5-6 -- unrecognized multi-word token form 'im'
The others are the colloquial forms you mentioned earlier, so are fine.