Why is self-distill-training better?

Question

Why is self-distill-training better?

taintpro98 opened this issue 3 years ago · comments

You trained cityscapes with self distillation mode. I found that flow was not different from train source mode. I didn't understand why it was better. Can you provide some fundaments or theories that can explain this problem ? Thanks

Haoran Wang · Answer 1 · Mon Apr 19 2021 19:29:24 GMT+0800 (China Standard Time)

Hi, I believe this is still an open question, which still lacks concrete explanations. I would like to share some opinions, but unfortunately I cannot promise they may always hold true. We perform self-distillation mainly inspired by the previous works like Born-Again Neural Network and Label refinery. Self-distillation would produce pseudo labels for target domain training images and a student network trained on these pseudo labels would perform better on target domain than the teacher model.

Personally, I believe the improvement mainly comes from two parts:

Previous research proved that training a student network could produce labels better consistent with the input image and our student network should perform better than the pseudo labels provided by the teacher.
The self-distillation is performed on target domain training data directly, which would help the network to learn a direct connection between target domain data and the target domain labels and help the network adapted to target domain better.