Combined metrics ignore some arguments

Question

Combined metrics ignore some arguments

mattjeffryes opened this issue 2 years ago · comments

Using evaluate.combine, some kwargs seem not to get passed to the sub metrics resulting in incorrect outputs.

Using the examples from precision

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))

Out:

{'precision': 0.6666666666666666}
{'precision': 0.5}

0.666... is the correct answer

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))

Out:

{'precision': 0.23529411764705882}
{'precision': 0.5}

0.235... is the correct answer

This issue occurred with all metrics I tried (precision, recall and F1).

Perhaps I am using the function incorrectly, but if so this behaviour was very surprising to me.

Mac OS 13.2 on M1, Python 3.10.9, evaluate 0.4.0

Lorenzo Borelli · Answer 1 · Sat Feb 18 2023 18:06:32 GMT+0800 (China Standard Time)

You are right, it seems the keyword arguments are overridden in evaluate.combine. I'll send a PR

Branislav Vezilic · Answer 2 · Tue Feb 28 2023 16:56:52 GMT+0800 (China Standard Time)

Hi! I think there's also a few more use cases where combine method doesn't process arguments correctly. Mainly looking at average argument. For example:

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine(['precision', 'recall'])
metrics.compute(predictions, references, average='micro')

Or even when metric is initialized with average argument

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine([evaluate.load('precision', average='micro'), evaluate.load('recall', average='micro')])
metrics.compute(predictions, references)

Both cases results in ValueError

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

Leandro von Werra · Answer 3 · Wed Mar 15 2023 04:47:09 GMT+0800 (China Standard Time)

PR is merged, can you try if it works now?

Branislav Vezilic · Answer 4 · Thu Mar 16 2023 02:33:28 GMT+0800 (China Standard Time)

I can confirm that the my #1 case with average keyword works correctly.

Interestingly, as I was using a fresh python environment, evaluate doesn't initially have dependency to scikit-learn which is necessary to run metrics such as precision or recall. Maybe that's intended behaviour to reduce dependencies, but just wanted to mention it.

My #2 case still doesn't work, might be due to priority of what arguments are used. But I believe it's a separate issue than this one.

Ioannis Pikoulis · Answer 5 · Mon Aug 21 2023 21:26:26 GMT+0800 (China Standard Time)

I still have the same issue as @bvezilic. Is there a workaround?

Facundo Santiago · Answer 6 · Thu Nov 09 2023 00:55:40 GMT+0800 (China Standard Time)

Same here. This is problematic when different metrics require different arguments, like this:

metrics = evaluate.combine(
    [
        evaluate.load("bertscore", lang="en"),
        evaluate.load("bleu"),
        evaluate.load('rouge', use_aggregator=False)
    ]
)

results = metrics.compute(predictions=predictions, references=groundtruths)

This returns the error (ValueError) Either 'lang' (e.g. 'en') or 'model_type' (e.g. 'microsoft/deberta-xlarge-mnli') must be specified.

If lang is indicated in the metrics.compute method, then other metrics, like bleu will throw the exception TypeError: _compute() got an unexpected keyword argument 'lang' , because this parameter is not for it.