Newbeeer / L_DMI

Code for NeurIPS 2019 Paper, "L_DMI: An Information-theoretic Noise-robust Loss Function"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why bad performance without model pre-training?

lfeng1995 opened this issue · comments

It seems that if we directly use the model without pre-training, the obtained performance will be extremely terrible. Can someone explain that? Thanks!

Hi,

Sorry for the late response. It's a very good question. My guess is that if we don't pretrain and directly apply L_dmi for training, the gradient is exploded and it's very hard to schedule the learning rate.

To illustrate this, the gradient to the matrix A under L_dmi loss is : \partial log(∣det(A)∣) =(A^{−1})^T. Note that when we random intialize a classifiers, the det(A) or det(submatrix of A) is rathely small. It leads to very large elements in (A^{−1})^T. Hence the gradient explode.

If we pretrain the model for a while, det(A) or det(submatrix of A) would be much more amendable.

Thanks.

Yilun