Questions on DAM creation
DianCh opened this issue · comments
Hi! Thank you for releasing such a wonderful work. How DAM is generated was a bit unclear to me when reading the paper. Assuming there are N
tokens in total from the encoder (considering one feature level, then N = H x W
), and M
object queries:
- Regarding "In the case of the dense attention, DAM can be easily obtained by summing up attention maps from every decoder layer", do you mean the cross-attention with shape
N x M
? - Regarding "produces a single map of the same size as the feature map from the backbone", how is this achieved? Could you help walk through the calculation and the shapes of the tensors?
- Why not directly use the DAM to select the top-k tokens and why have a separate scoring network?
Thanks! I look forward to your reply.
Hi, thanks for your interest in our works!
Let's assume there are N
tokens from the encoder output and M
object queries from the decoder. Then DAM will be constructed as a vector of size N
, as same as encoder token length.
-
In the case of dense attention, (cross) attention map is calculated by
softmax(QK^T/\\sqrt{d})
, of shapeM x N
. Then by summing up each row, a vector of sizeN
will be obtained as DAM. In the case of deformable attention, please refer to Appendix A.2 Dam Creation in Deformable Attention or https://github.com/kakaobrain/sparse-detr/blob/main/util/dam.py#L29-L67. -
Here, let's consider only one decoder layer. (For multiple layers, total DAM will be a sum of each layer's DAM.)
In the case of dense attention, DAM will be a row sum of
softmax(QK^T/\\sqrt{d})
.
In the case of deformable attention,a. Obtain
sampling_locations
andattention_weights
. (Dimensions are in https://github.com/kakaobrain/sparse-detr/blob/main/util/dam.py#L30-31)
b. Accumulateattention_weights
at positionsampling_locations
, using bilinear interpolation.
c. Finally, the result will be a vector of sizeN
. -
In the forward path, decoder cross attentions proceed after the encoder proceeds. Before encoder top-k token selection, DAM cannot be created, thus cannot be used. Therefore, we let a scoring network predicts DAM (before its creation) and select encoder top-k tokens based on the scoring prediction.
Feel free to ask further questions, thanks!
@JWoong-Shin I found that when accumulating attention_weights
at position sampling_locations
, here the sampling_locations
is a float number, DAM scatter the attention_weights
to the corresponding token location according to sampling_locations
, thus here requires an inverse bilinear interpolation instead of bilinear interpolation.
Line 64 in 1ea7a06
@JWoong-Shin I found that when accumulating
attention_weights
at positionsampling_locations
, here thesampling_locations
is a float number, DAM scatter theattention_weights
to the corresponding token location according tosampling_locations
, thus here requires an inverse bilinear interpolation instead of bilinear interpolation.Line 64 in 1ea7a06
.
Sorry, I don't get the point. In deformable attention, it borrows values (V) weighted by attention weight that bilinear interpolation is applied. Then, in the perspective of the value token (V), its referenced amount (count in discrete case) is a sum of the attention weights of every query. Therefore, still, bilinear interpolation should be used. The detailed formula is described in Appendix A.2. If I'm missing something please let me know.
@JWoong-Shin I give a figure to explain my question:
As the figure shows, we assume the sampling_locations
is (x, y) with attention_weights
, the goal of DAM is to accumulate the attention of token i.e, (x1, y1), (x1, y2), (x2, y1), (x2, y2) from sampling_locations
, i.e. (x, y). The implementation is to scatter add the attention_weights
to the token locations according to the distance between token locations and sampling_locations
, right?
However, the bilinear interpolation usually is an inverse version of the above operation, i.e, given the value of token locations and the interpolated location, to calculate the value of interpolated location(x, y).
the goal of DAM is to accumulate the attention of token i.e, (x1, y1), (x1, y2), (x2, y1), (x2, y2) from sampling_locations, i.e. (x, y).
Yes, for some query q, it will obtain value by A * G((x,y), (x1, y1)) * v(x1, y1) + ...
, where A
is attention_weights
, G
is a bilinear interpolation kernel, and v
is the value at the point (x1, y1).
In other words, for the query q, it references (x1, y1) by A * G((x,y), (x1, y1)). (I think we are having different understanding here).
Therefore, in the perspective of gridpoint (x1, y1), the DAM value is accumulated by A * G((x,y), (x1, y1)) for the query q, and summing over every query, DAM is created. (Sum is not conducted inside the attn_map_to_flat_grid
method. The method obtains interpolated attention weights in the grid shape, and then obtain DAM by summing over decoder queries and decoder layers:
sparse-detr/models/deformable_detr.py
Lines 408 to 409 in 1ea7a06
In other words, for the query q, it references (x1, y1) by A * G((x,y), (x1, y1)). (I think we are having different understanding here).
Yeah, I get this now. The goal is to obtain the attention_weights
of reference points for each query, instead of the attention_weights
* value of reference points.
@JWoong-Shin Thanks for your quick response.