exact match and recall values

Question

exact match and recall values

rzanoli opened this issue 4 years ago · comments

Hello guys,

As regards, the exact match measure, in some cases (nested entities?) the scorer seems to produce incorrect Recall values. As an example, consider the results obtained by the scorer for the ‘true’ and ‘pred’ sequences below.

For this example I would expect to have TP=Correct=1, FN=1, and Re=TP/(TP+FN)=1/2=0.5. That is because we were able to correctly extract 1 entity (i.e., "start": 1, "end": 2) among the 2 entities in the gold standard. However, the scorer obtains 0.33.

true = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 10}] ]

pred = [ [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 5}, {"label": "PER", "start": 6, "end": 10}] ]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=['PER'])

results, results_per_tag = evaluator.evaluate()

print(results)

'exact': {'correct': 1, 'incorrect': 2, 'partial': 0, 'missed': 0, 'spurious': 0, 'possible': 3, 'actual': 3, 'precision': 0.3333333333333333, 'recall': 0.3333333333333333, 'f1': 0.3333333333333333}}

Matt Upson · Answer 1 · Fri Jun 12 2020 05:13:43 GMT+0800 (China Standard Time)

Thanks @rzanoli . I'll try to have a look at this over the weekend 👍

Yang Zhong · Answer 2 · Thu Jun 18 2020 02:28:14 GMT+0800 (China Standard Time)

Hi, is this problem already solved? I checked the source code, finding that the code treats POS (possible = correct + incorrect + missed + partial) as TP + FN, which is inconsistent with the original definition of all entities extracted from gold labels. See source code here.

https://github.com/ivyleavedtoadflax/nervaluate/blob/ce37a3a9369c76edbd434bc7ffcdbb45be202f5a/nervaluate/nervaluate.py#L413-L416

https://github.com/ivyleavedtoadflax/nervaluate/blob/ce37a3a9369c76edbd434bc7ffcdbb45be202f5a/nervaluate/nervaluate.py#L428-L449

Problems happen due to the logic of the code on counting each scenario. (starting by a loop over all predicted terms.)
https://github.com/ivyleavedtoadflax/nervaluate/blob/ce37a3a9369c76edbd434bc7ffcdbb45be202f5a/nervaluate/nervaluate.py#L221-L228

rzanoli · Answer 3 · Mon Jun 29 2020 20:06:52 GMT+0800 (China Standard Time)

Hi, I was wondering if there are any news about this issue.

Matt Upson · Answer 4 · Tue Jun 30 2020 07:38:02 GMT+0800 (China Standard Time)

Hey thanks for reminding me. I've got some time off coming up in the next week or so, so I will look into this. Apologies for not fixing it sooner.

Matt Upson · Answer 5 · Sat Oct 17 2020 06:43:19 GMT+0800 (China Standard Time)

@rzanoli this is now fixed by #32. Note that due to some changes introduced in #32 the output evaluation would be:

true = [
    [{"label": "PER", "start": 1, "end": 2}, {"label": "PER", "start": 3, "end": 10}]
]

pred = [
    [
        {"label": "PER", "start": 1, "end": 2},
        {"label": "PER", "start": 3, "end": 5},
        {"label": "PER", "start": 6, "end": 10},
    ]
]

from nervaluate import Evaluator

evaluator = Evaluator(true, pred, tags=["PER"])

results, results_per_tag = evaluator.evaluate()

print(results["exact"])

{
    'correct': 1, 
    'incorrect': 1, 
    'partial': 0, 
    'missed': 0, 
    'spurious': 1, 
    'possible': 2, 
    'actual': 3, 
    'precision': 0.3333333333333333, 
    'recall': 0.5, 
    'f1': 0.4
}