Why bad performance without model pre-training?
lfeng1995 opened this issue · comments
It seems that if we directly use the model without pre-training, the obtained performance will be extremely terrible. Can someone explain that? Thanks!
Hi,
Sorry for the late response. It's a very good question. My guess is that if we don't pretrain and directly apply L_dmi for training, the gradient is exploded and it's very hard to schedule the learning rate.
To illustrate this, the gradient to the matrix A under L_dmi loss is : \partial log(∣det(A)∣) =(A^{−1})^T. Note that when we random intialize a classifiers, the det(A) or det(submatrix of A) is rathely small. It leads to very large elements in (A^{−1})^T. Hence the gradient explode.
If we pretrain the model for a while, det(A) or det(submatrix of A) would be much more amendable.
Thanks.
Yilun