Should be possible to use the proper aggregated loss for early stopping

Question

Should be possible to use the proper aggregated loss for early stopping

johann-petrak opened this issue 3 years ago · comments

Currently there is no easy way to use the aggregated loss as used for optimization also for early stopping.

It is possible to define a function for the ES metric which can aggregate the per-head losses from the devset evaluation,
but for this, the global step and batch number parameters cannot be provided because they do not get passed to the check_stopping method. Also, that function would get applied to the already accumulated losses per head. If the
accumulation function is not linear, then the accfun of the accumulated losses would not be identical to the accumulated
accfun of the losses.

Not sure how to best make this work.

One option could be:

check in the Evaluator if the loss aggregation method is set
if yes, calculate the aggregated loss for all batches in the dev set, pass the batch
store the aggregated loss as "aggregated_loss" in the result for all heads (since there is nothing to store metrics over all heads)
add optional kw arg global step to Evaluator.eval
pass the last global step from training when evaluating in Trainer.train on the devset

Timo Moeller · Answer 1 · Fri Nov 12 2021 00:48:55 GMT+0800 (China Standard Time)

Hey @johann-petrak as usual we would be happy about your contributions but please be patient regarding getting our feedback, we are currently very busy with topics related to Haystack. If the solution is very encapsulated I could see it being reviewed and merged rather quickly though : )

Johann Petrak · Answer 2 · Fri Nov 12 2021 01:42:54 GMT+0800 (China Standard Time)

I have a solution for this and quite a number of other things in some code I copy-pasted because I needed to use it quite quickly for myself.
I can definitely provide a PR for this after I have completed the work here for my own deadlines in the next couple of weeks.
The implementation would basically just use a slightly modified evaluator class, also inside the train method, and a predefined metric "aggregated-loss" or similar in addition to "loss" which gets copied into the evaluation results for all heads (since the return value is a list of per-head result dicts).

stale · Answer 3 · Sat Mar 12 2022 19:39:31 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.