MantisAI / nervaluate

Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

exact match and recall values

rzanoli opened this issue · comments

Hello guys,

As regards, the exact match measure, in some cases (nested entities?) the scorer seems to produce incorrect Recall values. As an example, consider the results obtained by the scorer for the ‘true’ and ‘pred’ sequences below.

For this example I would expect to have TP=Correct=1, FN=1, and Re=TP/(TP+FN)=1/2=0.5. That is because we were able to correctly extract 1 entity (i.e., "start": 1, "end": 2) among the 2 entities in the gold standard. However, the scorer obtains 0.33.

true = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 10}] ]

pred = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 5}, {"label": "PER", "start": 6, "end": 10}] ]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=['PER'])

results, results_per_tag = evaluator.evaluate()

print(results)

'exact': {'correct': 1, 'incorrect': 2, 'partial': 0, 'missed': 0, 'spurious': 0, 'possible': 3, 'actual': 3, 'precision': 0.3333333333333333, 'recall': 0.3333333333333333, 'f1': 0.3333333333333333}}

Thanks @rzanoli . I'll try to have a look at this over the weekend 👍

Hi, is this problem already solved? I checked the source code, finding that the code treats POS (possible = correct + incorrect + missed + partial) as TP + FN, which is inconsistent with the original definition of all entities extracted from gold labels. See source code here.

https://github.com/ivyleavedtoadflax/nervaluate/blob/ce37a3a9369c76edbd434bc7ffcdbb45be202f5a/nervaluate/nervaluate.py#L413-L416

https://github.com/ivyleavedtoadflax/nervaluate/blob/ce37a3a9369c76edbd434bc7ffcdbb45be202f5a/nervaluate/nervaluate.py#L428-L449

Problems happen due to the logic of the code on counting each scenario. (starting by a loop over all predicted terms.)
https://github.com/ivyleavedtoadflax/nervaluate/blob/ce37a3a9369c76edbd434bc7ffcdbb45be202f5a/nervaluate/nervaluate.py#L221-L228

Hi, I was wondering if there are any news about this issue.

Hey thanks for reminding me. I've got some time off coming up in the next week or so, so I will look into this. Apologies for not fixing it sooner.

@rzanoli this is now fixed by #32. Note that due to some changes introduced in #32 the output evaluation would be:

true = [
    [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 10}]
]

pred = [
    [
        {"label": "PER", "start": 1, "end": 2},
        {"label": "PER", "start": 3, "end": 5},
        {"label": "PER", "start": 6, "end": 10},
    ]
]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=["PER"])

results, results_per_tag = evaluator.evaluate()

print(results["exact"])

{
    'correct': 1, 
    'incorrect': 1, 
    'partial': 0, 
    'missed': 0, 
    'spurious': 1, 
    'possible': 2, 
    'actual': 3, 
    'precision': 0.3333333333333333, 
    'recall': 0.5, 
    'f1': 0.4
}