Why bad performance without model pre-training?

Question

Why bad performance without model pre-training?

lfeng1995 opened this issue 4 years ago · comments

It seems that if we directly use the model without pre-training, the obtained performance will be extremely terrible. Can someone explain that? Thanks!

Yilun Xu · Answer 1 · Wed Jul 08 2020 18:28:09 GMT+0800 (China Standard Time)

Hi,

Sorry for the late response. It's a very good question. My guess is that if we don't pretrain and directly apply L_dmi for training, the gradient is exploded and it's very hard to schedule the learning rate.

To illustrate this, the gradient to the matrix A under L_dmi loss is : \partial log(∣det(A)∣) =(A^{−1})^T. Note that when we random intialize a classifiers, the det(A) or det(submatrix of A) is rathely small. It leads to very large elements in (A^{−1})^T. Hence the gradient explode.

If we pretrain the model for a while, det(A) or det(submatrix of A) would be much more amendable.

Thanks.

Yilun