OATML / RHO-Loss

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

batch size for comparing gradient steps between the standard and target model

loubnabnl opened this issue · comments

Hi, I have a question about the training steps you use to compare the standard model (just with shuffling) to the target model (trained with rho-loss selection). If I understood correctly from this codebase, you train the standard model with batch size of 320, and in the target model training, each gradient step corresponds to the 32 selected samples.

So in N training steps, the standard model will have seen 10x more tokens than the target model? Why don’t we compare models after a fixed number of seen tokens, in this case the standard model training should use a batch size of 32 too.

Ok I see thanks! I assumed it was 320 based on the batch size in this config.

Do you have an idea of how the method generalizes for a high selection percentage (in the paper you tested a max of 20%) because in LM pre-training for example the standard batch size is usually large e.g 256/512 and it would be computationally expensive to pre-sample 2500/5120 examples each time

@loubnabnl

@SoerenMind is right that we didn't play around too much with the selection percentage, but we did do a little ablation. You can see that in appendix F of the paper. Also attaching a picture of that below. As you can see the impact varies, and likely depends on, amongst others, the dataset size, dataset composition and batch sizes.
image