amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement a way of pattern/rule tracing

ftyers opened this issue · comments

It would be useful to be able to trace the output of the program, e.g. to be able to see which patterns are matched. e.g. for each token to know what form/text/lemma/child/agree is set to.

If you hover on the mentions in HTML output you'll see a tooltip with most of these (maybe more could be shown, not sure what you mean by child)

I mean on the command line, sometimes I write a rule and it doesn't work (e.g. nothing appears in the HTML), it would be good to be able to trace why it might not be working. It could work by e.g. printing out the line of the dependency tree in CoNLL and then a list of matched variables, e.g.

2       Пушкин  Пушкин  PROPN   _       Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing   3       nsubj   _       _
form = proper
text = Пушкин
lemma = Пушкин
agree = 3sg,male

etc.

What do you mean by not appearing in the HTML? Unless you have singleton detection switched off, all mentions that are not ruled out by a stop list should show up in HTML. If singletons are on and something doesn't show up, it means the system rejected it as a mention very early. Or are you looking to get 'currently attested categories' on all tokens?

Aha! Ok, that was the problem. I had remove_singletons=True in the config.ini.

But even so:

form="proper";form="proper"&lemma=$1;100;nopropagate

I have this rule in coref_rules.tab, and here is what i'm getting from the HTML:

captura de 2017-11-17 19-19-03
captura de 2017-11-17 19-19-14

If I had to guess, I'd guess that the agreement information is shooting down the match. Note how one is 'male' and the other is 'Animacy=Anim|....'. As far as xrenner is concerned, the latter is a monolithic value.

There are two main ways of dealing with this - one is to use DepEdit rules to collapse annoying classes, which can be good because you can use syntactic conditions. Another is to fiddle with the 'Agreement Class Detection' section of config.ini, especially morph_rules. Here's an example from my German model, which relies on RFTagger morphological features:

# Edit morphology information - cascade of string replace rules to use on the morph field in conll data if available
morph_rules=.*([12]).*(Sg|Pl).*/\1\2;([12])Sg/\1;^[^0-9].*(Pl).*/\1;^[^0-9].*(Fem|Masc|Neut).*/\1;.*\.\*$/_

This takes tags like this:

  • PRO.Pers.Subst.3.Nom.Pl.*
  • N.Reg.Dat.Sg.Fem
    And makes them like this:
  • Pl
  • Fem

Aha, ok, I added:

morph_rules=[^|]+|Gender=Masc|[^|]+/male

Now I get:
captura de 2017-11-17 21-05-15
captura de 2017-11-17 21-05-27

And the only two rules I have in the coref_rules.tab are:

$ cat models/rus/coref_rules.tab  | grep -v '^#'
form="proper";form="proper"&text=$1;100;nopropagate
form="proper";form="proper"&lemma=$1;100;nopropagate

They both seem to have the same lemma and the agreement features are the same too.

OK, that's definitely weird. Did you put proper nouns in lemma_match_pos? Or maybe turned on proper_mod_must_match?

If it's not one of those, could you send me the model and the parse?

# What POS categories should allow lemma matching of heads for coreference? e.g. /^NNS?$/ to allow singular and plural nouns to match based on lemma
lemma_match_pos=/none/
...
# Do proper noun modifiers have to match exactly across mentions? (NB: this may include proper modifiers such as Mr.!! Often leaving this False is better)
proper_mod_must_match=False
...

I'll send over the zip file with the model and the conllu file :)