IntelLabs / bayesian-torch

A library for Bayesian neural network layers and uncertainty estimation in Deep Learning extending the core of PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The average should be taken over log probability rather than logits

Nebularaid2000 opened this issue · comments

for mc_run in range(args.num_mc):
output, kl = model(input_var)
output_.append(output)
kl_.append(kl)
output = torch.mean(torch.stack(output_), dim=0)

I think the average across the MC runs should be taken over the log probability. However, the output here is the logits before the softmax operation. I think we may first run output = F.log_softmax(output, dim=1) and then take the average.

There are two equivalent ways to take the average, which I think is more reasonable.
The first way is

for mc_run in range(args.num_mc):
    output, kl = model(input_var)
    output = F.log_softmax(output, dim=1)
    output_.append(output)
    kl_.append(kl)
output = torch.mean(torch.stack(output_), dim=0)
loss= F.nll_loss(output, target_var) # this is to replace the original cross_entropy_loss

Or equivalently, we can first take the cross-entropy loss for each MC run, and average the losses at the end:

loss = 0
for mc_run in range(args.num_mc):
    output, kl = model(input_var)
    loss = loss + F.cross_entropy(output, target_var, dim=1)
    kl_.append(kl)
loss = loss / args.num_mc  # this is to replace the original cross_entropy_loss

There is not much of a rule for that on training but I think it would be worth to try it and see if it improves training.

For inference 100% sure average should be over probabilities.

@piEsposito Thanks very much for the reply! Actually, I'm not very certain about the exact training and inference process using Bayes-by-backprop.

I have edited my post. The second way is more understandable. The core difference between my thought and the original code is that I calculate the cross entropy loss for each MC run and average them in the end, while the original code first averaged the outputs (logits) from the MC runs and use this averaged output to calculate the loss.

According to the paper Weight Uncertainty in Neural Networks (see Eq. (1)), and the derivation in Good Initializations of Variational Bayes for Deep Models (see Eq. (6) and Eq. (7)), I believe that in training, we should first calculate the loss for each MC run and then average the loss in the end, instead of average the outputs first.

Please correct me if I'm wrong :-)

By the way, I'm a bit confused about your saying "For inference 100% sure average should be over probabilities". It seems like for inference, we have 3 ways to average the outputs over different MC runs:

  1. average over the logits before the softmax operation
  2. average over the probability
  3. average over the log probability

I'm not sure which one to use in inference. The first one seems reasonable since it is used in model ensemble. Could you give some hints or suggestions on this? Thanks a lot.

@Nebularaid2000 I would suggest "2. average over the probability" since you need the predictive probabilities for computing the uncertainty metrics e.g. predictive entropy (https://github.com/IntelLabs/bayesian-torch/blob/main/bayesian_torch/utils/util.py#L45) or mutual information (https://github.com/IntelLabs/bayesian-torch/blob/main/bayesian_torch/utils/util.py#L53)

@ranganathkrishnan Thanks for your answer!
What about the calculation of the loss in the training process? Is it better if we follow Weight Uncertainty in Neural Networks (Eq. (1)) to first calculate the cross entropy loss for each MC run and then average them, instead of first averaging the output and then calculating the cross entropy loss?

@Nebularaid2000 If multiple MC samples are used during training, I think it should be better to calculate the cross entropy loss for each MC run and then average them if that helps with better training convergence. The run script for training can be modified (snippet below). There was no difference in current example runscript as num_mc=1.

        #another way of computing gradients with multiple MC samples
        cross_entropy_ = [] 
        kl_ = []
        output_ = []
        for mc_run in range(args.num_mc):
            output, kl = model(input_var)
            cross_entropy_.append(criterion(output, target_var))
            kl_.append(kl)
            output_.append(output)
        output = torch.mean(torch.stack(output_), dim=0)
        loss = torch.mean(torch.stack(cross_entropy_), dim=0) + torch.mean(torch.stack(kl_), dim=0)/args.batch_size