Is the number of POSSIBLE different from the number of "B-" tokens?

Question

Is the number of POSSIBLE different from the number of "B-" tokens?

Hans0124SG opened this issue 4 years ago · comments

Thanks for the great library.

Just wondering about the logic of calculating the number of the POSSIBLE tokens.
If my pred is
B-ORG, I-ORG, B-ORG, I-ORG
and my true label is
B-ORG, I-ORG, I-ORG, I-ORG

I think the current logic will calculate the POSSIBLE as 2? But there is only 1 gold-standard annotation.

If 2 is correct, that means POSSIBLE cannot be interpreted as the number of gold-standard entities in the data, am I right?

Matt Upson · Answer 1 · Sun Apr 19 2020 22:45:32 GMT+0800 (China Standard Time)

Hi @Hans0124SG thanks for raising an issue, can you write a short reproducible example?

Hans0124SG · Answer 2 · Sun Apr 19 2020 22:55:37 GMT+0800 (China Standard Time)

Sure.

from nervaluate import compute_metrics, collect_named_entities

true = ['O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O']
pred = ['O', 'B-ORG', 'I-ORG', 'B-ORG', 'I-ORG', 'O']
result, entity_level_result = compute_metrics(collect_named_entities(true), collect_named_entities(pred), ['ORG'])
entity_level_result['ORG']

I get the following output:

{'strict': {'correct': 0,
  'incorrect': 2,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2},
 'ent_type': {'correct': 2,
  'incorrect': 0,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2},
 'partial': {'correct': 0,
  'incorrect': 0,
  'partial': 2,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2},
 'exact': {'correct': 0,
  'incorrect': 2,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2}}

Possible is 2, but there is only 1 entity in the true label sequence.

Matt Upson · Answer 3 · Mon Apr 20 2020 01:36:23 GMT+0800 (China Standard Time)

Thanks for this @Hans0124SG. This looks like bug. possible should be interpreted as the maximum matches in the true data.

Hans0124SG · Answer 4 · Mon Apr 20 2020 10:03:41 GMT+0800 (China Standard Time)

Thanks @ivyleavedtoadflax
I suspect that this is not really a bug.

According to the definition:
POS = COR + INC + PAR + MIS = TP + FN
However, since the predicted entity is never 1-to-1 mapped to the true entity, TP + FN does not equal to all the positive labels.

That's why I feel the POS is just not the total number of gold-standard entities.

How do you think?

Btw, this phenomenon exists in @davidsbatista 's ner_evaluation library as well.

Matt Upson · Answer 5 · Mon Apr 20 2020 21:09:54 GMT+0800 (China Standard Time)

Yes you're right, I realized while implementing a test for it. There are a few unexpected results like this, I think there was another one raised in the issues on @davidsbatista's original repo.

The solution is probably just to document them.

Hans0124SG · Answer 6 · Mon Apr 20 2020 21:31:00 GMT+0800 (China Standard Time)

Yeah great, thanks for the confirmation. Hope this is useful for other people who have the same doubt.