Training with Different Student Model?

Question

Training with Different Student Model?

MohammadVahidiPro opened this issue 10 months ago · comments

Hello,

I wanted to express my gratitude for your contributions to the community. Your work has been incredibly helpful.

I am working on a project involving the training of a DistilBERT base student model using your framework. I have a question regarding the choice of teachers. Do I need to select teachers that are also DistilBERT-based, or can I use the same teachers used in the original paper?

Do you believe there are any further adjustments needed when training a different base model? I would greatly appreciate your insights and advice on this matter.

Thank you sincerely for your help.

Best regards,

Mohammad Vahidi commented 10 months ago

up plz

Sachin Chanchani · Answer 1 · Sun Sep 24 2023 10:09:19 GMT+0800 (China Standard Time)

See On the Efficacy of Knowledge Distillation - a model like DistillBERT with small capacity will generally require a smaller teacher or under-trained large teacher for distillation to work. Using another DistillBERT-family model as teacher will probably be better.

There are some other ideas about the impact of the sharpness of teacher labels, so maybe sweep different temperature values (for the student and teacher) and see if that helps.

Glad to hear that the repo has been useful to you :) and I'm interested to hear about your results!

Edit: another good discussion, for context https://openreview.net/forum?id=h-z_zqT2yJU&noteId=-FLNfm3Pd8q

Mohammad Vahidi · Answer 2 · Mon Sep 25 2023 15:35:40 GMT+0800 (China Standard Time)

fascinating stuff!
Thanks you for the fast reply, I greatly appreciate your guidance, I will certainly try out the ideas you mentioned and hopefully get back to you with promising results. Thanks again for this interesting repo.

best regards