Fine-tuning recommendations

Question

Fine-tuning recommendations

RainIwakura opened this issue a year ago · comments

Hello EleutherAI team! Congrats on your ICML acceptance for this paper. I'm a PhD student intending to study emergent properties of GPT-type of decoder models using Pythia for my final paper, and was wondering if you have any insight into what sort of learning rate, data format, or optimizer you'd recommend on using to achieve "optimal" results for fine-tuning these models on custom small datasets (rules conditioned on size perhaps), at least in your experience.

When I say small I mean at most 100k training points (pieces of text) of around 60 tokens each in terms of length if that helps. Anyways, I'd welcome any information! Apologies if I missed anything obvious within the repo that'd point me to this info. It seems like the majority of the information is on pretraining and I do not intend to do that.

Hailey Schoelkopf · Answer 1 · Fri Apr 28 2023 00:25:19 GMT+0800 (China Standard Time)

Hi! We haven't finetuned these models much at all yet ourselves, so not confident on hyperparameter choice for best results. I'd recommend trying out a batch size of 256 or 128 points, and starting learning rate at (0.1x the max LR used for pretraining), decaying on a cosine schedule to 0.1x this starting LR plus linear LR warmup, as a starting point! But not sure of the optimality of this or if it will work on your dataset.

Almas Abdibayev · Answer 2 · Fri Apr 28 2023 03:18:10 GMT+0800 (China Standard Time)

Thank you! Some starting pointers is all I needed.