MantisAI / nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is the number of POSSIBLE different from the number of "B-" tokens?

Hans0124SG opened this issue · comments

Thanks for the great library.

Just wondering about the logic of calculating the number of the POSSIBLE tokens.
If my pred is
B-ORG, I-ORG, B-ORG, I-ORG
and my true label is
B-ORG, I-ORG, I-ORG, I-ORG

I think the current logic will calculate the POSSIBLE as 2? But there is only 1 gold-standard annotation.

If 2 is correct, that means POSSIBLE cannot be interpreted as the number of gold-standard entities in the data, am I right?

Hi @Hans0124SG thanks for raising an issue, can you write a short reproducible example?

Sure.

from nervaluate import compute_metrics, collect_named_entities

true = ['O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O']
pred = ['O', 'B-ORG', 'I-ORG', 'B-ORG', 'I-ORG', 'O']
result, entity_level_result = compute_metrics(collect_named_entities(true), collect_named_entities(pred), ['ORG'])
entity_level_result['ORG']

I get the following output:

{'strict': {'correct': 0,
  'incorrect': 2,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2},
 'ent_type': {'correct': 2,
  'incorrect': 0,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2},
 'partial': {'correct': 0,
  'incorrect': 0,
  'partial': 2,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2},
 'exact': {'correct': 0,
  'incorrect': 2,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'precision': 0,
  'recall': 0,
  'f1': 0,
  'actual': 2,
  'possible': 2}}

Possible is 2, but there is only 1 entity in the true label sequence.

Thanks for this @Hans0124SG. This looks like bug. possible should be interpreted as the maximum matches in the true data.

Thanks @ivyleavedtoadflax
I suspect that this is not really a bug.

According to the definition:
POS = COR + INC + PAR + MIS = TP + FN
However, since the predicted entity is never 1-to-1 mapped to the true entity, TP + FN does not equal to all the positive labels.

That's why I feel the POS is just not the total number of gold-standard entities.

How do you think?

Btw, this phenomenon exists in @davidsbatista 's ner_evaluation library as well.

Yes you're right, I realized while implementing a test for it. There are a few unexpected results like this, I think there was another one raised in the issues on @davidsbatista's original repo.

The solution is probably just to document them.

Yeah great, thanks for the confirmation. Hope this is useful for other people who have the same doubt.