huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Home Page:https://huggingface.co/docs/evaluate

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Merged PR #425 doesn't properly address the issue of passing metric specific kwargs to compute or load

christian-storm opened this issue · comments

In the spirit of trying to give back, I offer you this analysis, a call to settle this issue once and for all, and a possible solution. That being said, I'm aware that there are likely unknown unknowns which I'm not considering (I'd be curious to know what they are though!).

Parties involved (as far as I can tell)

@Plutone11011 @lvwerra @NimaBoscarino

Environment

  • OS: Mac 13.3.1
  • Python: 3.10.4
  • PyTorch: 2.0.1
  • Transformers: 4.26.1
  • Evaluate: 0.4 and 0.4.1.dev0

Exec summary

This all started when I wanted to combine leslyarun/fbeta_score with f1, precision and recall. In particular, issues arose when I tried to change the default beta from 0.5 to something else.

After spending way too much time on this, it became evident there has been some hand wringing about how to set metric parameters in all the different contexts they can be set. PR #425 addressed the issue of kwargs being filtered out when using combined metrics. However, I quickly found out that it assumes all the combined metrics share the same kwarg keys.

I created this issue in the hopes of spurring a conversation to surface all the use cases (evaluate.combine dispatching kwargs to intended metric(s), Evaluator needing to handle metric kwargs, and whether evaluate.load and/or compute should except kwargs) as well as the requirements (backwards compatibility(?), syncing with hub, etc.).

History as written by issues and PRs

The initial attempt

Issue #169 "Move kwargs from compute to a config pass during load" fixed by PR #188

The logic and implementation plan

"Currently, there is a mix between configs that are used when the metric is instantiated and configs
that are later passed to compute. This makes life harder in several ways as pointed out in #137 and #138 and also
makes the combine function from #150 harder to use. The two main points:

  • It is hard to know in advance what configs are available (e.g. useful for eval on the hub)
  • It is hard to configure metrics in advance that are used in a wrapped object. One always needs to pass the kwargs
    downstream (affects evaluator and combine).

I think we could solve this by moving all kwargs to the load. If a user wants to run a metric with different configs
they can just load several instances of the metric which is cheap. In addition we could wrap the configs in something
like a dataclass that specifies the types of configs and options when only a limited number of options are
available (e.g. F1 can only be use binary or multilabel)." - @lvwerra

Reverted b/c hub <-> library issues

However, PR #169 was reverted in PR #299 because "There is an issue with the sync mechanism between the hub and
the library which is why we had to roll back #169. Merging it will break pre 0.3.0 installs so we need to wait
for sufficient adaptation.

There was also some concern, that I would second, about removing the ability to set parameters at compute time.

Issues keep being raised

Since then issues #234 "Cannot use f1/recall/precision arguments in CombinedEvaluations.compute"and #423 "Combined metrics ignore some arguments", and #338 "Evaluator should handle metric kwargs" which @NimaBoscarino has assigned to themself.

Second attempt

@Plutone11011 took the initiative and submitted PR #425 which has since been merged into main.
The solution was to remove the line of code which filters out kwargs that aren't
in a metric modules ._feature_names() see https://github.com/huggingface/evaluate/pull/425/files.

This works for the cases where the combined metrics all expect the same kwarg keys and, for that matter, each metric
intends to use the same kwarg values. As demonstrate below this doesn't work when metrics expect different kwargs.
Furthermore, it wouldn't cover the case of needing metric specific kwarg values for the same keys.

Test code

import evaluate
import math

# fbeta_score is the generalized F-score where F1 is a special case of beta=1.
combined_evaluation = evaluate.combine(["recall", "precision", "f1", evaluate.load('leslyarun/fbeta_score')])

predictions = [0, 1, 0]
references = [1, 1, 0]

# fbeta_score defaults to beta = .5 aka the F0.5 score.  The expected value for f_beta_score is F0.5.
expected_result = {"recall": 1.00, "precision": 0.50, "f1": 0.66, "f_beta_score": 0.55}

# 0.4.1.dev0 which includes PR #425 fixed the passing of kwargs to combine, e.g., pos_label.  The default for pos_label
# is 1.  It should be noted that all the metrics in this example use this kwarg.
result = combined_evaluation.compute(predictions=predictions, references=references, pos_label=0)

diff_result = [f"{metric} {expected_result[metric]} != {result[metric]}"
               for metric in ['recall', 'precision', 'f1', 'f_beta_score']
               if not math.isclose(result[metric], expected_result[metric], abs_tol=0.01)]

# This assertion will fail in 0.4, but not 0.4.1.dev0, because the pos_label isn't passed to the metrics' compute functions.
assert not diff_result, f"Expected metrics do not match returned metrics: {diff_result}"

# Let's change beta to 1 or the F1 score to see if we can pass this kwarg to f_beta_score
beta = 1.

# In 0.4.1.dev0 this throws an exception because beta is passed to all the metrics but only f_beta_score expects it
result = combined_evaluation.compute(predictions=predictions, references=references, pos_label=0, beta=beta)
expected_result = {"recall": 1.00, "precision": 0.50, "f1": 0.66, "f_beta_score": 0.66}

diff_result = [f"{metric} {expected_result[metric]} != {result[metric]}"
               for metric in ['recall', 'precision', 'f1', 'f_beta_score']
               if not math.isclose(result[metric], expected_result[metric], abs_tol=0.01)]
assert not diff_result, f"Expected metrics do not match returned metrics: {diff_result}"

Exceptions

Current 0.4.0 version (Reason for PR #425)

All the results are wrong because none of the kwargs make it to the metrics.

File "test_combine.py", line 59, in <module>
    assert not diff_result, f"Expected metrics do not match returned metrics: {diff_result}"
AssertionError: Expected metrics do not match returned metrics: ['recall 1.0 != 0.5', 'precision 0.5 != 1.0', 'f_beta_score 0.55 != 0.8333333333333334']

0.4.1.dev0 (Has PR #425)

This blows up because all the kwargs are being sent to all the metrics.

  File "/evaluate/src/evaluate/module.py", line 462, in compute
    output = self._compute(**inputs, **compute_kwargs)
TypeError: Recall._compute() got an unexpected keyword argument 'beta'

Requirements (as far as I can tell)

  1. Backwards compatible(?)
  2. No issues syncing hub with library
  3. Ability to configure metric on load and override at compute time (?)
  4. Support named configurations (already supported with config_name)
  5. Pass metric specific kwargs with specific values if the kwarg is used in more than one metric

Possible Solution?

Why not add a compute_kwargs alongside features aka compute_args to MetricInfo instantiation? compute_kwargs would define the kwarg keys and default values in one fell swoop.

It could be defined as a dataclass like PR #188 or a dict as I've done below. Configuration at load would be handled by adding **init_compute_kwargs to def _info(self, **init_compute_kwargs). Similarly _compute could use the same logic and accept **kwargs instead of listing the kwargs out individually as is commonly done now.

What it might look like

class FBeta(evaluate.Metric):
    def _info(self, **init_compute_kwargs):
        compute_kwargs = {'beta': 0.5, 'labels': None, 'pos_label': 1, 'average': "binary", 'sample_weight': None}
        assert not (set(init_compute_kwargs.keys) - set(compute_kwargs.keys()))
        compute_kwargs.update(init_compute_kwargs)
        return evaluate.MetricInfo(
            # This is the description that will appear on the modules page.
            module_type="metric",
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            # This defines the format of each prediction and reference
            features=datasets.Features({
                'predictions': datasets.Value('int32'),
                'references': datasets.Value('int32')}),
           compute_kwargs=compute_kwargs
           ....

  def _compute(predictions, references, **kwargs):
     assert not (set(self.compute_kwargs.keys) - set(kwargs.keys()))
     self.compute_kwargs.update(kwargs)
     

This leaves the issue of passing kwargs to evaluate.combine and, especially, the subsequent compute. It seems that
they would need to accept something to the effect of a a dict of dicts where the key is the name of the metric and the value is the metric specific kwargs. This would solve for the problem of routing metric specific kwargs while allowing metric specific kwarg values when there is kwarg name overlap.

I can't speak to library <-> hub syncing which might make all this moot.

I hope this helps?!

A few misdirections I encountered while researching **kwargs