Aggregate evaluation metrics from different environments/tasks

Question

Aggregate evaluation metrics from different environments/tasks

qgallouedec opened this issue a year ago · comments

Training generates evaluation data from several environments/tasks. We want to aggregate all these evaluations to get a statistically sound measure of the results. To do this, we can use rliable, perhaps within a general evaluator?
One difficulty could be the number of times the model has to be trained from scratch.