huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Home Page:https://huggingface.co/docs/evaluate

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Combined metrics ignore some arguments

mattjeffryes opened this issue · comments

Using evaluate.combine, some kwargs seem not to get passed to the sub metrics resulting in incorrect outputs.

Using the examples from precision

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], pos_label=0))

Out:

{'precision': 0.6666666666666666}
{'precision': 0.5}

0.666... is the correct answer

import evaluate
metric1 = evaluate.load('precision')
metric2 = evaluate.combine(['precision'])

print(metric1.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))
print(metric2.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]))

Out:

{'precision': 0.23529411764705882}
{'precision': 0.5}

0.235... is the correct answer

This issue occurred with all metrics I tried (precision, recall and F1).

Perhaps I am using the function incorrectly, but if so this behaviour was very surprising to me.

Mac OS 13.2 on M1, Python 3.10.9, evaluate 0.4.0

You are right, it seems the keyword arguments are overridden in evaluate.combine. I'll send a PR

Hi! I think there's also a few more use cases where combine method doesn't process arguments correctly. Mainly looking at average argument. For example:

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine(['precision', 'recall'])
metrics.compute(predictions, references, average='micro')

Or even when metric is initialized with average argument

import evaluate

predictions = [0, 2, 1, 0, 0, 1]
references = [0, 1, 2, 0, 1, 2]

metrics = evaluate.combine([evaluate.load('precision', average='micro'), evaluate.load('recall', average='micro')])
metrics.compute(predictions, references)

Both cases results in ValueError

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

PR is merged, can you try if it works now?

I can confirm that the my #1 case with average keyword works correctly.

Interestingly, as I was using a fresh python environment, evaluate doesn't initially have dependency to scikit-learn which is necessary to run metrics such as precision or recall. Maybe that's intended behaviour to reduce dependencies, but just wanted to mention it.

My #2 case still doesn't work, might be due to priority of what arguments are used. But I believe it's a separate issue than this one.

I still have the same issue as @bvezilic. Is there a workaround?

Same here. This is problematic when different metrics require different arguments, like this:

metrics = evaluate.combine(
    [
        evaluate.load("bertscore", lang="en"),
        evaluate.load("bleu"),
        evaluate.load('rouge', use_aggregator=False)
    ]
)

results = metrics.compute(predictions=predictions, references=groundtruths)

This returns the error (ValueError) Either 'lang' (e.g. 'en') or 'model_type' (e.g. 'microsoft/deberta-xlarge-mnli') must be specified.

If lang is indicated in the metrics.compute method, then other metrics, like bleu will throw the exception TypeError: _compute() got an unexpected keyword argument 'lang' , because this parameter is not for it.