SysCV / idisc

iDisc: Internal Discretization for Monocular Depth Estimation [CVPR 2023]

Home Page:https://arxiv.org/abs/2304.06334

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Internal Discretization Figure

baolinv opened this issue · comments

Hi, thanks for your great work!
I doubt how is (d) Internal discretization in Figure 1 in the paper generated.

I infer that the id is the max value of (QiKi) from "(QiKi) is the spatial location for which each specific IDR is responsible ", as described at the end of the fourth page of the paper.

Could you provide me with the concrete computation process?

Thank you for your appreciation!
The figure you are mentioning was produced in the following recipe: picking the attention map of the first iteration/attention layer (since the second layer is a residual update, it was less meaningful) of the ISD heads. Then, for each attention map selected (see below), upsampling them to the output resolution (1/4 of input image resolution) and equalizing them by thresholding the attention map at 0.5 and 0.99 and rescaling to [0,1] with, e.g., low_q, up_q = torch.quantile(attn_map, 0.5), torch.quantile(attn_map, 0.98); attn_map = torch.clamp(attn_map, low_q, up_q); attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min()) (equalization was done to prevent fog effect for some maps, that is why there are some gaps in the visualization maps).

The selected attention maps (I think, not sure, and the code is not immediate to get) were coming from IDRs number 0, 13, 30, 31 from the lowest resolution and IDRs number 2 from the highest resolution.

Thanks much for your detailed reply. But I still don't get similar semantic regions.

Can you help me check where the problem occurred for the following code?

Attention:
depth_attn

class ISDHead(nn.Module):
    def forward(self, feature_map: torch.Tensor, idrs: torch.Tensor, isshow_attn=True):
        b, c, h, w = feature_map.shape
        feature_map = rearrange(feature_map + self.pixel_pe(feature_map), "b c h w -> b (h w) c")
        depth_attn = None
        for i in range(self.depth):
            update = getattr(self, f"cross_attn_{i + 1}")(feature_map.clone(), idrs)
            feature_map = feature_map + update
            feature_map = feature_map + getattr(self, f"mlp_{i + 1}")(feature_map.clone())

            if i == 0:
                depth_attn = update

        out = getattr(self, "proj_output")(feature_map)
        out = rearrange(out, "b (h w) c -> b c h w", h=h, w=w)
        if isshow_attn:
            return out, depth_attn
        else:
            return out

Generate ID from Attention:
cls_map

              attn_map=torch.reshape(attn_map,[1,h,w,-1]).permute(0,3,1,2)
              attn_map = F.interpolate(attn_map, size=(120,160),mode="bilinear",align_corners=True)              
              low_q, up_q = torch.quantile(attn_map, 0.5), torch.quantile(attn_map, 0.98)
              attn_map = torch.clamp(attn_map, low_q, up_q)
              attn_map = (attn_map - attn_map.min()) / (attn_map.max() - attn_map.min())
              attn=attn_map.squeeze().cpu().numpy()
              
             cls_map = np.argmax(attn, axis=0).astype(np.uint8)

cls_map is IDRs? I suspect the issue is here. For test images, I don't get similar semantic regions like paper using the code.

I'm looking forward to your response.

The first snippet works fine, and I guess you are returning depth_attn also from the ISD class, too, as a list of depth_attn for each resolution. The second part should be a bit different.
I believe that the pixel-wise argmax operation returns really noisy maps due to some attention collapsing onto each other as shown in the paper (i.e., some attention maps become almost identical to others).
What we actually did was select some representative, i.e., different within each other, attention maps from, e.g., 5 maps from the list attn, then threshold them at 0.5 (you can avoid this base don what you want to visualize), and then plot them on the image. The list of the indices of the attention maps you can see in the teaser figure should (not absolute certainty) be the one in my first comment.

Thanks much for your quick response. I'm sorry to disturb you again.

You say:
"What we actually did was select some representative, i.e., different within each other"
however, it is a little difficult to define uniform rules for "different within each other".

I try to use rules related to threshold, var, cluster, and so on. But the perfect image is not generated as shown in the paper.

If convenient, could you share your code by email? (my email: xingxx2011@gmail.com)

I really want to reproduce the result and I'm looking forward to your response.

Hi. I have already configured the environment, but I don’t know how to use your code to find the depth of a picture. How can I get the depth map?

I also doubt how is (d) Internal discretization in Figure 1 in the paper generated. Could you share the code? Thanks.