yangsenius / TransPose

Hi, first thank you for making this work open-source.
I notice that position embeddings are summed up to sequence at each Transformer layer.
But in Bert or ViT, they only conduct this PE sum operation once before sending the sequence to Transformer encoder.
I wonder what's the motivation to design like this.

[W TensorIterator.cpp:918] Warning: Mixed memory format inputs detected while calling the operator. The operator will output contiguous tensor even if some of the inputs are in channels_last format. (function operator()) [W TensorIterator.cpp:924] Warning: Mixed memory format inputs detected while calling the operator. The operator will output channels_last tensor even if some of the inputs are not in channels_last format. (function operator())

TransPose/lib/models/transpose_h.py

Line 522 in 904eb4b

pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)

TransPose/lib/models/transpose_h.py

Line 523 in 904eb4b

pos = pos.flatten(2).permute(2, 0, 1)

maybe contiguous() is needed

Hi~ @0liliulei.

From the view of the whole Transformer Encoder, only injecting PE to the input sequence once is enough. But for each self-attention layer, it is also permutation-equivariant to its input sequence. We think it is beneficial to add consistent position embedding to all attention layers since human pose estimation is a localization task rather than image classification (ViT). This task may be sensitive to position information, particularly the last few layers. This is our motivation.

And we empirically find adding PE to each layer performs a little better than only adding it to the initial input. In addition, DETR also conducts ablations on this and the results show adding PE to each attention layer is better.

Regarding the warning information you reported, I haven't encountered this problem. I don't know what your situation causes it. And thank you very much for your suggestion!

Thank you for your answer, as PE is important for training a Transformer.