Held-in validation set details

Question

Held-in validation set details

gahdritz opened this issue a year ago · comments

Were examples in the held-in validation set pre- and postprocessed exactly as the corresponding examples in the training set? If so, what were the mixture coefficients for the zero shot, template_mix, etc. tasks?

shayne-longpre · Answer 1 · Fri Apr 07 2023 04:19:53 GMT+0800 (China Standard Time)

@gahdritz Yes, we used the training templates.py for the held-in evaluation tasks as well (usually the first template per task I believe). Evaluations were with zsopt or fsopt, depending on whether the evaluation setting reported Zero-Shot or Few-Shot.

If I understand your second question correctly, every training task within a submixture (e.g. flan_zsopt) was given a weighting in this line of code: https://github.com/google-research/FLAN/blob/main/flan/v2/mixtures_utils.py#L133.

Essentially, the rate is the minimum of the submixture cap (defined here) and the number of examples already in the task. So maybe 3 (task, rate) pairs are: [(A, 100), (B, 1000), (C, 10)], meaning we will sample 100 B examples for every 10 A examples and every 1 C example. And then the epoch finishes when any of these tasks run out of examples (it should be roughly a tie).

Gustaf Ahdritz · Answer 2 · Fri Apr 07 2023 06:25:46 GMT+0800 (China Standard Time)

@shayne-longpre Thanks! What do you mean by "depending on whether the evaluation setting reported Zero-Shot or Few-Shot."? What is the "evaluation setting"? Did you run evals separately for both zero-shot and few-shot?

shayne-longpre · Answer 3 · Fri Apr 07 2023 06:30:50 GMT+0800 (China Standard Time)

@gahdritz Sure! Yes, for some experiments we ran Zero-Shot eval and for some Few-shot eval, but that should be clearly noted in the Figure/Table caption in both papers. These choices were made often to show comparative performance in these settings.