Internal Discretization Figure

Question

Internal Discretization Figure

baolinv opened this issue a year ago · comments

Hi, thanks for your great work!
I doubt how is (d) Internal discretization in Figure 1 in the paper generated.

I infer that the id is the max value of (QiKi) from "(QiKi) is the spatial location for which each specific IDR is responsible ", as described at the end of the fourth page of the paper.

Could you provide me with the concrete computation process?

Luigi Piccinelli · Answer 1 · Mon Jul 31 2023 21:12:16 GMT+0800 (China Standard Time)

Thank you for your appreciation!
The figure you are mentioning was produced in the following recipe: picking the attention map of the first iteration/attention layer (since the second layer is a residual update, it was less meaningful) of the ISD heads. Then, for each attention map selected (see below), upsampling them to the output resolution (1/4 of input image resolution) and equalizing them by thresholding the attention map at 0.5 and 0.99 and rescaling to [0,1] with, e.g., low_q, up_q = torch.quantile(attn_map, 0.5), torch.quantile(attn_map, 0.98); attn_map = torch.clamp(attn_map, low_q, up_q); attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min()) (equalization was done to prevent fog effect for some maps, that is why there are some gaps in the visualization maps).

The selected attention maps (I think, not sure, and the code is not immediate to get) were coming from IDRs number 0, 13, 30, 31 from the lowest resolution and IDRs number 2 from the highest resolution.

baolinv · Answer 2 · Mon Aug 07 2023 19:21:39 GMT+0800 (China Standard Time)

Thanks much for your detailed reply. But I still don't get similar semantic regions.

Can you help me check where the problem occurred for the following code?

Attention:
depth_attn

class ISDHead(nn.Module):
    def forward(self, feature_map: torch.Tensor, idrs: torch.Tensor, isshow_attn=True):
        b, c, h, w = feature_map.shape
        feature_map = rearrange(feature_map + self.pixel_pe(feature_map), "b c h w -> b (h w) c")
        depth_attn = None
        for i in range(self.depth):
            update = getattr(self, f"cross_attn_{i + 1}")(feature_map.clone(), idrs)
            feature_map = feature_map + update
            feature_map = feature_map + getattr(self, f"mlp_{i + 1}")(feature_map.clone())

            if i == 0:
                depth_attn = update

        out = getattr(self, "proj_output")(feature_map)
        out = rearrange(out, "b (h w) c -> b c h w", h=h, w=w)
        if isshow_attn:
            return out, depth_attn
        else:
            return out

Generate ID from Attention:
cls_map

              attn_map=torch.reshape(attn_map,[1,h,w,-1]).permute(0,3,1,2)
              attn_map = F.interpolate(attn_map, size=(120,160),mode="bilinear",align_corners=True)              
              low_q, up_q = torch.quantile(attn_map, 0.5), torch.quantile(attn_map, 0.98)
              attn_map = torch.clamp(attn_map, low_q, up_q)
              attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min())
              attn=attn_map.squeeze().cpu().numpy()
              
             cls_map = np.argmax(attn, axis=0).astype(np.uint8)

cls_map is IDRs? I suspect the issue is here. For test images, I don't get similar semantic regions like paper using the code.

I'm looking forward to your response.

Luigi Piccinelli · Answer 3 · Tue Aug 08 2023 20:28:02 GMT+0800 (China Standard Time)

The first snippet works fine, and I guess you are returning depth_attn also from the ISD class, too, as a list of depth_attn for each resolution. The second part should be a bit different.
I believe that the pixel-wise argmax operation returns really noisy maps due to some attention collapsing onto each other as shown in the paper (i.e., some attention maps become almost identical to others).
What we actually did was select some representative, i.e., different within each other, attention maps from, e.g., 5 maps from the list attn, then threshold them at 0.5 (you can avoid this base don what you want to visualize), and then plot them on the image. The list of the indices of the attention maps you can see in the teaser figure should (not absolute certainty) be the one in my first comment.

baolinv · Answer 4 · Wed Aug 09 2023 16:03:17 GMT+0800 (China Standard Time)

Thanks much for your quick response. I'm sorry to disturb you again.

You say:
"What we actually did was select some representative, i.e., different within each other"
however, it is a little difficult to define uniform rules for "different within each other".

I try to use rules related to threshold, var, cluster, and so on. But the perfect image is not generated as shown in the paper.

If convenient, could you share your code by email? (my email: xingxx2011@gmail.com)

I really want to reproduce the result and I'm looking forward to your response.

Fan Shixiong · Answer 5 · Mon Aug 28 2023 22:43:11 GMT+0800 (China Standard Time)

Hi. I have already configured the environment, but I don’t know how to use your code to find the depth of a picture. How can I get the depth map?

suxuanya · Answer 6 · Tue May 14 2024 10:56:50 GMT+0800 (China Standard Time)

I also doubt how is (d) Internal discretization in Figure 1 in the paper generated. Could you share the code? Thanks.