Combined metrics ignore some arguments
mattjeffryes opened this issue · comments
Using evaluate.combine
, some kwargs seem not to get passed to the sub metrics resulting in incorrect outputs.
Using the examples from precision
import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])
print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))
Out:
{'precision': 0.6666666666666666}
{'precision': 0.5}
0.666... is the correct answer
import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])
print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))
Out:
{'precision': 0.23529411764705882}
{'precision': 0.5}
0.235... is the correct answer
This issue occurred with all metrics I tried (precision, recall and F1).
Perhaps I am using the function incorrectly, but if so this behaviour was very surprising to me.
Mac OS 13.2 on M1, Python 3.10.9, evaluate 0.4.0
You are right, it seems the keyword arguments are overridden in evaluate.combine
. I'll send a PR
Hi! I think there's also a few more use cases where combine
method doesn't process arguments correctly. Mainly looking at average
argument. For example:
import evaluate
predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]
metrics = evaluate.combine(['precision', 'recall'])
metrics.compute(predictions, references, average='micro')
Or even when metric is initialized with average
argument
import evaluate
predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]
metrics = evaluate.combine([evaluate.load('precision', average='micro'), evaluate.load('recall', average='micro')])
metrics.compute(predictions, references)
Both cases results in ValueError
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
PR is merged, can you try if it works now?
I can confirm that the my #1
case with average
keyword works correctly.
Interestingly, as I was using a fresh python environment, evaluate doesn't initially have dependency to scikit-learn which is necessary to run metrics such as precision
or recall
. Maybe that's intended behaviour to reduce dependencies, but just wanted to mention it.
My #2
case still doesn't work, might be due to priority of what arguments are used. But I believe it's a separate issue than this one.
I still have the same issue as @bvezilic. Is there a workaround?
Same here. This is problematic when different metrics require different arguments, like this:
metrics = evaluate.combine(
[
evaluate.load("bertscore", lang="en"),
evaluate.load("bleu"),
evaluate.load('rouge', use_aggregator=False)
]
)
results = metrics.compute(predictions=predictions, references=groundtruths)
This returns the error (ValueError) Either 'lang' (e.g. 'en') or 'model_type' (e.g. 'microsoft/deberta-xlarge-mnli') must be specified
.
If lang
is indicated in the metrics.compute
method, then other metrics, like bleu
will throw the exception TypeError: _compute() got an unexpected keyword argument 'lang'
, because this parameter is not for it.