Jacy and ERG divergence
goodmami opened this issue · comments
From the ACE generation (ERG) log files in a translation pipeline:
25841 EP 'ja:_koto_n_nom' is not covered
25008 EP 'ja:neg_x' is not covered
24090 EP 'ja:coord_c' is not covered
21305 EP 'ja:_te_p_adjunct' is not covered
14923 EP 'ja:unspec_adj' is not covered
14923 EP 'ja:degree' is not covered
12529 EP 'ja:_you_n' is not covered
11217 EP 'ja:adversative' is not covered
7539 EP 'ja:_ni_p' is not covered
7482 EP 'ja:udef_q' is not covered
7340 EP 'ja:vv' is not covered
7122 EP 'ja:_suru_v_soc' is not covered
6587 EP 'ja:_kudasaru_v_aux' is not covered
5926 EP 'ja:_no_p' is not covered
4323 EP 'ja:_comma_d' is not covered
4054 EP 'ja:unknown_v' is not covered
3286 EP 'ja:_tokoro_n_2' is not covered
3164 EP 'ja:_ga_d' is not covered
3121 EP 'ja:_hou_n_7' is not covered
3115 EP 'ja:_はやる_v_unk' is not covered
3084 EP 'ja:_sha_a_4' is not covered
3076 EP 'ja:discourse_x' is not covered
3075 EP 'ja:_mo_d' is not covered
2934 EP 'ja:_chuu_n' is not covered
2779 EP 'ja:plus' is not covered
2309 EP 'ja:_made_p' is not covered
2267 EP 'ja:_mato_n' is not covered
2199 EP 'ja:_tame_n_5' is not covered
2190 EP 'ja:dofw' is not covered
This is a partial list. On the left are the occurrence counts. It's not surprising that Jacy predicates are not covered by the ERG, but when they are very frequent it means that JaEn should perhaps have a hand-built rule to catch the cases when the automatically extracted rules fail to transfer something. In some cases, there is such a rule, but it has become outdated. For instance, neg_x
is not covered because JaEn's rule still targets neg_v
. Similarly, JaEn targets coord
instead of coord_c
.
And here's some of those that aren't covered on the ERG side:
30754 EP 'def_q' is not covered
16051 EP 'implicit_q' is not covered
5386 EP '_good_a_at-for' is not covered
4053 EP 'of_rel_noun_mark' is not covered
3168 EP '_house_n_1' is not covered
2879 EP '_so_c' is not covered
2266 EP 'time_n' is not covered
1540 EP 'place_n' is not covered
1269 EP 'abstr_deg' is not covered
889 EP 'def_implicit_q' is not covered
848 EP '_soon_p' is not covered
794 EP '_home_p' is not covered
654 EP '_late_p' is not covered
555 EP '_here_a_1' is not covered
537 EP 'manner' is not covered
517 EP '_yesterday_a_1' is not covered
502 EP '_tomorrow_a_1' is not covered
435 EP '_bear_v_2' is not covered
383 EP '_there_a_1' is not covered
354 EP 'thing' is not covered
300 EP '_as_p_comp' is not covered
297 EP '_grandmother_n_1' is not covered
264 EP '_of_x_subord' is not covered
259 EP '_i_n_num' is not covered
240 EP 'numbered_hour' is not covered
188 EP 'pron' is not covered
There some other reasons for these, but generally it's also because the hand-built JaEn rules are out of date. The def_q
and implicit_q
ones are because the modified SEM-I for the ERG missed.
_koto_n_nom
can perhaps just be dropped, or be added to the auto-include set for my extractor (gets included in an extracted transfer rule even if it didn't exist in the predicate alignment, as long as it is incorporated into the rest of the MRS fragment).
unspec_adj
and degree
have the same count because they always co-occur. There should be a general rule or two written for these. Maybe:
;;; e.g. 2 キロ の 水 -- 2 kilograms of water
degree+unspec_adj--noun+of_p := monotonic_mtr &
[ INPUT.RELS < #m [ LBL #h1, ARG0 #x2 ],
[ PRED "ja:degree", LBL #h3, ARG1 #e4, ARG2 #x2 ],
[ PRED "ja:unspec_adj", LBL #h3, ARG0 #e4, ARG1 #x3 ],
[ LBL #h3, ARG0 #x3 ] >,
OUTPUT.RELS < #m [ LBL #h1, ARG0 #x2 ],
[ PRED "_of_p_rel", LBL #h1, ARG1 #x2, ARG2 #x3 ],
[ ARG0 #x3 ] > ].
and a different rule for the generic_entity case... although the rule above might be broken...
_you_n
would be tough to write rules for... but maybe it can just be dropped.