delph-in / JaEn

Japanese↔English transfer grammar for machine translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jacy and ERG divergence

goodmami opened this issue · comments

From the ACE generation (ERG) log files in a translation pipeline:

  25841 EP 'ja:_koto_n_nom' is not covered
  25008 EP 'ja:neg_x' is not covered
  24090 EP 'ja:coord_c' is not covered
  21305 EP 'ja:_te_p_adjunct' is not covered
  14923 EP 'ja:unspec_adj' is not covered
  14923 EP 'ja:degree' is not covered
  12529 EP 'ja:_you_n' is not covered
  11217 EP 'ja:adversative' is not covered
   7539 EP 'ja:_ni_p' is not covered
   7482 EP 'ja:udef_q' is not covered
   7340 EP 'ja:vv' is not covered
   7122 EP 'ja:_suru_v_soc' is not covered
   6587 EP 'ja:_kudasaru_v_aux' is not covered
   5926 EP 'ja:_no_p' is not covered
   4323 EP 'ja:_comma_d' is not covered
   4054 EP 'ja:unknown_v' is not covered
   3286 EP 'ja:_tokoro_n_2' is not covered
   3164 EP 'ja:_ga_d' is not covered
   3121 EP 'ja:_hou_n_7' is not covered
   3115 EP 'ja:_はやる_v_unk' is not covered
   3084 EP 'ja:_sha_a_4' is not covered
   3076 EP 'ja:discourse_x' is not covered
   3075 EP 'ja:_mo_d' is not covered
   2934 EP 'ja:_chuu_n' is not covered
   2779 EP 'ja:plus' is not covered
   2309 EP 'ja:_made_p' is not covered
   2267 EP 'ja:_mato_n' is not covered
   2199 EP 'ja:_tame_n_5' is not covered
   2190 EP 'ja:dofw' is not covered

This is a partial list. On the left are the occurrence counts. It's not surprising that Jacy predicates are not covered by the ERG, but when they are very frequent it means that JaEn should perhaps have a hand-built rule to catch the cases when the automatically extracted rules fail to transfer something. In some cases, there is such a rule, but it has become outdated. For instance, neg_x is not covered because JaEn's rule still targets neg_v. Similarly, JaEn targets coord instead of coord_c.

And here's some of those that aren't covered on the ERG side:

  30754 EP 'def_q' is not covered
  16051 EP 'implicit_q' is not covered
   5386 EP '_good_a_at-for' is not covered
   4053 EP 'of_rel_noun_mark' is not covered
   3168 EP '_house_n_1' is not covered
   2879 EP '_so_c' is not covered
   2266 EP 'time_n' is not covered
   1540 EP 'place_n' is not covered
   1269 EP 'abstr_deg' is not covered
    889 EP 'def_implicit_q' is not covered
    848 EP '_soon_p' is not covered
    794 EP '_home_p' is not covered
    654 EP '_late_p' is not covered
    555 EP '_here_a_1' is not covered
    537 EP 'manner' is not covered
    517 EP '_yesterday_a_1' is not covered
    502 EP '_tomorrow_a_1' is not covered
    435 EP '_bear_v_2' is not covered
    383 EP '_there_a_1' is not covered
    354 EP 'thing' is not covered
    300 EP '_as_p_comp' is not covered
    297 EP '_grandmother_n_1' is not covered
    264 EP '_of_x_subord' is not covered
    259 EP '_i_n_num' is not covered
    240 EP 'numbered_hour' is not covered
    188 EP 'pron' is not covered

There some other reasons for these, but generally it's also because the hand-built JaEn rules are out of date. The def_q and implicit_q ones are because the modified SEM-I for the ERG missed.

_koto_n_nom can perhaps just be dropped, or be added to the auto-include set for my extractor (gets included in an extracted transfer rule even if it didn't exist in the predicate alignment, as long as it is incorporated into the rest of the MRS fragment).

unspec_adj and degree have the same count because they always co-occur. There should be a general rule or two written for these. Maybe:

;;; e.g. 2 キロ の 水 -- 2 kilograms of water
degree+unspec_adj--noun+of_p := monotonic_mtr &
[ INPUT.RELS < #m [ LBL #h1, ARG0 #x2 ],
               [ PRED "ja:degree", LBL #h3, ARG1 #e4, ARG2 #x2 ],
               [ PRED "ja:unspec_adj", LBL #h3, ARG0 #e4, ARG1 #x3 ],
               [ LBL #h3, ARG0 #x3 ] >,
  OUTPUT.RELS < #m [ LBL #h1, ARG0 #x2 ],
                [ PRED "_of_p_rel", LBL #h1, ARG1 #x2, ARG2 #x3 ],
                [ ARG0 #x3 ] > ].

and a different rule for the generic_entity case... although the rule above might be broken...

_you_n would be tough to write rules for... but maybe it can just be dropped.