Questions about prediction of SGNP

Question

Questions about prediction of SGNP

JianxiangFENG opened this issue 3 years ago · comments

Jianxiang Feng commented 3 years ago

Hi @jereliu ,

I have a few questions about the inference stage of SGNP:

According to the Eq 9) and Algorithrm 1) in the paper, shouldn't there be K precision matrix for each dimension of the output, where K is the number of class? And the dimension of each one is [ batch_size, batch_size], but the total matrix should be [K, batch_size, batch_size], am I understanding something wrong? And in the codes, I can just find the a single covariance matrix with size of [batch_size, batch_size].
After searching the codes for a while, I couldn't find the sampling step which is the 5th step in Algorithm 2). Without this sampling step, the prediction is similar to MAP prediction except for the difference during training. This way to make prediction should be essential in this method, right?

I would appreciate if you can explain more to me.

Best,
Jianxiang

Jeremiah Liu · Answer 1 · Wed Feb 03 2021 08:32:41 GMT+0800 (China Standard Time)

Hi Jianxiang,

Thanks for getting in touch! Sorry for the confusion about the mismatch between the paper and this implementation. Yes we made two changes for computational feasibility / performance reasons:

After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.
We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

Jianxiang Feng · Answer 2 · Fri Feb 05 2021 22:55:48 GMT+0800 (China Standard Time)

Thank you for the quick reply!

After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

Ok, it's more computationally efficient. However, I don't get the intuition that one variance for the classes can lead to better performance. Because one variance for all classes doesn't seem to make a lot of sense. It's just like temperature scaling with one temperature hyperparamter, instead of modelling the uncertainty for each class. Maybe for other scenarios different variances for different classes are needed. But thanks for letting me know about this.

We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

This is a neat and simple approximation. I am wondering how large is the difference between the sampling and the approximation. I am kind of sure you have done experiments on that. Any systematic comparisons or take-home messages about this?
Thank you in advance!

mdabbah · Answer 3 · Tue Mar 02 2021 07:06:44 GMT+0800 (China Standard Time)

Hi,
just throwing a possible explanation here for 1.
maybe one covariance matrix for all classes is better because it reduces the overfitting.
maybe on Large datasets, we would see the opposite (more intuitive) effect: better performance when using covariance matrix for each class, there we would have enough data to better approximate a covariance matrix for each class.

Jordy Van Landeghem · Answer 4 · Fri Jun 04 2021 05:04:30 GMT+0800 (China Standard Time)

Thank you for the quick reply!

After some experimentation, we replaced the Laplace-approximated posterior variance with that under Gaussian likelihood. So that one matrix is shared across all classes. Two reasons for making this change are (1) computationally feasibility (esp for ImageNet type of tasks), (2) empirically better OOD performance.

Ok, it's more computationally efficient. However, I don't get the intuition that one variance for the classes can lead to better performance. Because one variance for all classes doesn't seem to make a lot of sense. It's just like temperature scaling with one temperature hyperparamter, instead of modelling the uncertainty for each class. Maybe for other scenarios different variances for different classes are needed. But thanks for letting me know about this.

We replaced the Monte-Carlo approximation with the Mean-field approximation for computational feasibility (e.g., here. This is mentioned in Appendix A).

This is a neat and simple approximation. I am wondering how large is the difference between the sampling and the approximation. I am kind of sure you have done experiments on that. Any systematic comparisons or take-home messages about this?
Thank you in advance!

@JianxiangFENG Did you get or figure out an answer to your last question? I am wondering this myself :)

Jianxiang Feng · Answer 5 · Sat Jun 05 2021 18:52:36 GMT+0800 (China Standard Time)

@Jordy-VL hey, I did not follow it in the end. But the paper relevant paper (https://arxiv.org/abs/2006.0758) is worth reading.