How to do GuidedBackPropagation

Question

How to do GuidedBackPropagation

csyuhao opened this issue 4 years ago · comments

Hi @kazuto1011 , Thanks a lot for your great work! I have a small question when I try to use GuidedBackPropagation in my own model. In my model, I use 'PReLU' layer instead of 'ReLU' module. So how do I use GuidedBackPropagation? Should I do the same thing?

Kazuto Nakashima · Answer 1 · Mon Aug 17 2020 00:17:20 GMT+0800 (China Standard Time)

To my understanding, the core idea of guided backpropagation is to ignore the backflow of the signal x that decreases the score E. In the ReLU case, zero gradients often occur depending on the bottom-up flow; thus guided backpropagation can safely set the negative gradients ∂E/∂x<0 to zero in top-down flow. My opinion is that the same manner may work for PReLU while I'm not sure if the masked sparse gradients can be visualized clearly in the pixel space. Anyway, that can be done with the following backward hook, give it a try!

def backward_hook(module, grad_in, grad_out):
    if isinstance(module, nn.PReLU):
        return (F.relu(grad_in[0]),)

Alex Yu · Answer 2 · Mon Aug 17 2020 11:05:35 GMT+0800 (China Standard Time)

Thanks a lot for your quick response!

I try this backward hook and find an error, which is caused by the weight a of PReLU. The grad_in has two parameters. One is gradient of input data (specifically, the feature maps), the other is the gradient of weight a. However, the gradient of a is unrelated with the input data. We can just ignore it. Here are my backward hook.

def backward_hook(module, grad_in, grad_out):
    # cut off negative gradients
    if isinstance(module, nn.ReLU):
        return (F.relu(grad_in[0]), )
    elif isinstance(module, nn.PReLU):
        return (F.relu(grad_in[0]), grad_in[1], )

And I have another question about Grad-CAM with PReLU. In my own model, I replace ReLU with PReLU. In original paper of Grad-CAM, the weighted combination of feature maps ∑ α _k A^kis input into ReLU to cut off negative values. However, in my model, the negative values also have contribute to final logits.

So I think I should replace ReLU with PReLU when generating Grad-CAM.

fmaps = self._find(self.fmap_pool, target_layer)
grads = self._find(self.grad_pool, target_layer)
weights = F.adaptive_avg_pool2d(grads, 1)

gcam = torch.mul(fmaps, weights).sum(dim=1, keepdim=True)
if prelu_weight is not None:
    gcam = F.prelu(gcam, prelu_weight.mean())
else:
    gcam = F.relu(gcam)
gcam = F.interpolate(gcam, self.image_shape, mode='bilinear', align_corners=False)

The architecture of model is "target_layer (final Conv2d) ==> PReLU => Linear". So the weight PReLU is from the following the PReLU layer.

Do you think this is right to do Grad-CAM? If you give me advice, I would appreciate it.

Kazuto Nakashima · Answer 3 · Mon Aug 17 2020 12:13:45 GMT+0800 (China Standard Time)

Thank you for making the revision!
I think the original ∑ α_k A^k assumes that the activation A^k>=0; the nagative responses derive from α_k, so that the authors compute ReLU(∑ α_k A^k) to cut off the negative contribution. The issue is that A^k has meaningful negatives in your case, not that the final op is ReLU. So I propose to cut the negative gradients off first, although the result would be a bit different from Grad-CAM.

grads = F.relu(grads)
weights = F.adaptive_avg_pool2d(grads, 1)
gcam = torch.mul(fmaps, weights).sum(dim=1, keepdim=True)

I assume the fmaps is an activation map from the last PReLU like Grad-CAM.
Grad-CAM++ uses ReLU onto the weights as the generalized case.

Alex Yu · Answer 4 · Mon Aug 17 2020 13:13:17 GMT+0800 (China Standard Time)

Thanks for your advice. I will read Grad-CAM++ paper.