Decoding twice in validation step

Question

Decoding twice in validation step

Adamits opened this issue a year ago · comments

See here: https://github.com/CUNY-CL/yoyodyne/blob/master/yoyodyne/models/base.py#L219

Previously, I did this so I could first predict with greedy search to get an accuracy, and then predict with teacher forcing to compute a loss taht is comparable to how the train loss is computed. Currently, we predict twice with whatever is in the batch from the validation dataloader. IIRC this should have the gold targets in it.

This means that we are making identical validation predictions twice, neither of which is greedy -- they both have access to the gold history and use teacher forcing. I think we probably want to compute validation accuracy without the gold history, for which we would need to pass the batch through with no targets on it.

Then, we can either keep predicting a 2nd time to compute a loss with teacher forcing, or just compute the loss directly from the greedy predictions (and thus only decode once during validation).

@kylebgorman thoughts on this?

Kyle Gorman · Answer 1 · Thu Jun 01 2023 22:29:00 GMT+0800 (China Standard Time)

I agree that we don't want to compute validation accuracy with gold history.

Is it common to use teacher forcing in this class of model? If not we could just do away with it for now and leave it as a TODO to re-enable.

Adam · Answer 2 · Fri Jun 02 2023 00:43:42 GMT+0800 (China Standard Time)

My (possibly dated) understanding was that training w/ teacher forcing is very common. I recall in MT people used to sample the gold target with some probability (or else take the predicted token). I think the best way to train in morphological inflection may still be an open question (e.g. https://aclanthology.org/2020.coling-main.255.pdf).

Probably the simplest solution is to still train with teacher forcing, and to not pass the gold targets to the model forward method during eval, so that it defaults to greedy search. Then we can just compute both accuracy and loss from the greedy output.

We could consider adding a "student-forcing" option later.

Kyle Gorman · Answer 3 · Fri Jun 02 2023 00:44:47 GMT+0800 (China Standard Time)

This sounds like a good design to me.

Late edit: reading that paper I think it's imperative we validate and evaluate without teacher forcing; whether or not we support student-forcing training at some rate is less important to me.

Adam · Answer 4 · Fri Jun 02 2023 01:29:46 GMT+0800 (China Standard Time)

Ok. I made the update but realized a few things:

When we added the Batch class, we lost the original way of deciding when to decode with teacher forcing or not. Now it is a bit clunky, and I think we might be using teacher forcing anytime there is a target col in the data config. I think it might be safer to just force the validate and predict steps on the model to ignore target tensors either by a) setting batch.target.padded to None in those steps, or b) setting a teacher_forcing flag that tells the model to use/ignore the targets (setting this flag will require updating all of the decode methods).
Computing eval loss from student forced predictions is slightly tricky in the current setup. This is because we get mismatches of predicted v.s. gold length. I think I can use the student forced predictions (and thus only predict once) by: removing the decode code that stops decoding early if everything in the batch has predicted an [EOS] (so we don't get predictions that are shorter than the targets), and truncating the ultimate prediction tensors to the target sequence length (so we don't get longer predictions than targets). Then I think I can compute loss like usual.

Thoughts on the this?

Kyle Gorman · Answer 5 · Fri Jun 02 2023 03:40:13 GMT+0800 (China Standard Time)

Re: (1) nulling it out is hackish but it'll do!

(2) sounds awesome.