huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Home Page:https://huggingface.co/docs/evaluate

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for disaggregated evaluations in Evaluators

NimaBoscarino opened this issue · comments

For cases where we want to compute metrics across several groups using the Evaluators, the option is to call .compute with different splits of data. This is okay if the groups are disjoint, but for groups that have any overlap that means that we have to recompute inferences (which can be really costly) since there is no caching mechanism.

Maybe a working solution could be to add something like a disaggregate_by flag to the .compute for the Evaluators (or maybe even to the base evaluator) which could be used to calculate the metric on the specified folds of data after the inferences have been computed.

I made a custom evaluator for something like this while working on Disaggregators here: Google Colab

With the implementation above, the dataset is expected to have boolean columns like pronouns.he_him , pronouns.they_them, pronouns.she_her. Then, that list of column names is passed to disaggregate_by to compute disaggregated evaluations.

Additionally, it's often valuable to combine disaggregations and run metrics on intersectional groups, and I've also got that implemented in that custom evaluator in the notebook above.

Would there be any interest in something like this?