kakaobrain / sparse-detr

Hi! Thank you for releasing such a wonderful work. How DAM is generated was a bit unclear to me when reading the paper. Assuming there are N tokens in total from the encoder (considering one feature level, then N = H x W), and M object queries:

Regarding "In the case of the dense attention, DAM can be easily obtained by summing up attention maps from every decoder layer", do you mean the cross-attention with shape N x M?
Regarding "produces a single map of the same size as the feature map from the backbone", how is this achieved? Could you help walk through the calculation and the shapes of the tensors?
Why not directly use the DAM to select the top-k tokens and why have a separate scoring network?

Thanks! I look forward to your reply.

Hi, thanks for your interest in our works!

Let's assume there are N tokens from the encoder output and M object queries from the decoder. Then DAM will be constructed as a vector of size N, as same as encoder token length.

In the case of dense attention, (cross) attention map is calculated by softmax(QK^T/\\sqrt{d}), of shape M x N. Then by summing up each row, a vector of size N will be obtained as DAM. In the case of deformable attention, please refer to Appendix A.2 Dam Creation in Deformable Attention or https://github.com/kakaobrain/sparse-detr/blob/main/util/dam.py#L29-L67.
Here, let's consider only one decoder layer. (For multiple layers, total DAM will be a sum of each layer's DAM.)

In the case of dense attention, DAM will be a row sum of softmax(QK^T/\\sqrt{d}).
In the case of deformable attention,

a. Obtain sampling_locations and attention_weights. (Dimensions are in https://github.com/kakaobrain/sparse-detr/blob/main/util/dam.py#L30-31)
b. Accumulate attention_weights at position sampling_locations, using bilinear interpolation.
c. Finally, the result will be a vector of size N.
In the forward path, decoder cross attentions proceed after the encoder proceeds. Before encoder top-k token selection, DAM cannot be created, thus cannot be used. Therefore, we let a scoring network predicts DAM (before its creation) and select encoder top-k tokens based on the scoring prediction.

Feel free to ask further questions, thanks!

@JWoong-Shin I found that when accumulating attention_weights at position sampling_locations, here the sampling_locations is a float number, DAM scatter the attention_weights to the corresponding token location according to sampling_locations, thus here requires an inverse bilinear interpolation instead of bilinear interpolation.

sparse-detr/util/dam.py

Line 64 in 1ea7a06

weights = (attention_weights * valid_mask * margin).flatten(1)

.

@JWoong-Shin I found that when accumulating attention_weights at position sampling_locations, here the sampling_locations is a float number, DAM scatter the attention_weights to the corresponding token location according to sampling_locations, thus here requires an inverse bilinear interpolation instead of bilinear interpolation.

sparse-detr/util/dam.py

Line 64 in 1ea7a06

weights = (attention_weights * valid_mask * margin).flatten(1)

.

Sorry, I don't get the point. In deformable attention, it borrows values (V) weighted by attention weight that bilinear interpolation is applied. Then, in the perspective of the value token (V), its referenced amount (count in discrete case) is a sum of the attention weights of every query. Therefore, still, bilinear interpolation should be used. The detailed formula is described in Appendix A.2. If I'm missing something please let me know.

@JWoong-Shin I give a figure to explain my question:

As the figure shows, we assume the sampling_locations is (x, y) with attention_weights , the goal of DAM is to accumulate the attention of token i.e, (x1, y1), (x1, y2), (x2, y1), (x2, y2) from sampling_locations, i.e. (x, y). The implementation is to scatter add the attention_weights to the token locations according to the distance between token locations and sampling_locations, right?

However, the bilinear interpolation usually is an inverse version of the above operation, i.e, given the value of token locations and the interpolated location, to calculate the value of interpolated location(x, y).

the goal of DAM is to accumulate the attention of token i.e, (x1, y1), (x1, y2), (x2, y1), (x2, y2) from sampling_locations, i.e. (x, y).

Yes, for some query q, it will obtain value by A * G((x,y), (x1, y1)) * v(x1, y1) + ..., where A is attention_weights, G is a bilinear interpolation kernel, and v is the value at the point (x1, y1).

In other words, for the query q, it references (x1, y1) by A * G((x,y), (x1, y1)). (I think we are having different understanding here).

Therefore, in the perspective of gridpoint (x1, y1), the DAM value is accumulated by A * G((x,y), (x1, y1)) for the query q, and summing over every query, DAM is created. (Sum is not conducted inside the attn_map_to_flat_grid method. The method obtains interpolated attention weights in the grid shape, and then obtain DAM by summing over decoder queries and decoder layers:

sparse-detr/models/deformable_detr.py

Lines 408 to 409 in 1ea7a06

    
           flat_grid_attn_map_dec = attn_map_to_flat_grid( 
        
               spatial_shapes, level_start_index, sampling_locations_dec, attn_weights_dec).sum(dim=(1,2))

)

In other words, for the query q, it references (x1, y1) by A * G((x,y), (x1, y1)). (I think we are having different understanding here).

Yeah, I get this now. The goal is to obtain the attention_weights of reference points for each query, instead of the attention_weights * value of reference points.

@JWoong-Shin Thanks for your quick response.

	flat_grid_attn_map_dec = attn_map_to_flat_grid(
	spatial_shapes, level_start_index, sampling_locations_dec, attn_weights_dec).sum(dim=(1,2))

Questions on DAM creation