Why is self-distill-training better?
taintpro98 opened this issue · comments
You trained cityscapes with self distillation mode. I found that flow was not different from train source mode. I didn't understand why it was better. Can you provide some fundaments or theories that can explain this problem ? Thanks
Hi, I believe this is still an open question, which still lacks concrete explanations. I would like to share some opinions, but unfortunately I cannot promise they may always hold true. We perform self-distillation mainly inspired by the previous works like Born-Again Neural Network and Label refinery. Self-distillation would produce pseudo labels for target domain training images and a student network trained on these pseudo labels would perform better on target domain than the teacher model.
Personally, I believe the improvement mainly comes from two parts:
-
Previous research proved that training a student network could produce labels better consistent with the input image and our student network should perform better than the pseudo labels provided by the teacher.
-
The self-distillation is performed on target domain training data directly, which would help the network to learn a direct connection between target domain data and the target domain labels and help the network adapted to target domain better.