Validation Accuracy Aggregation
Adamits opened this issue · comments
Currently, our validation_step
method on the BaseEncoderDecoder
computes a per batch accuracy and aggregates them at the end of each epoch. Because of this, we get a macro average accuracy that will depend on the batch size.
I noticed something must be strange when using evaluation sets of size 1000 and getting validation accuracies to many decimal places (like 0.9247395...
). I think we probably want to accumulate raw counts of correct/incorrect dev samples per batch, and then aggregate those into an accuracy at the end of each epoch.
The impact of this should be small, but still, I believe we are getting slightly incorrect accuracies according to the expected micro accuracy.
Yes, I agree we want micro-accuracy not macro even though the only way I think they can differ is w.r.t. a partial batch.
Oh yeah good point. I guess we could also get some loss of precision too. Anyway, micro certainly seems preferable.
Closed in #120.