yitu-opensource / T2T-ViT

Hi Li,
I have just read your paper "Tokens-to-Token ViT", which proposes a very interesting and effective method. I have a question puzzling me.
The "unfold" operation followed by a linear layer for generating "qkv" seemingly equals a kxk convolution. I wonder whether I understand right. Please correct me if I'm wrong.
Hoping for your reply.
Best regards.

T2T-ViT/models/t2t_vit.py

Line 63 in fecacc4

self.soft_split0 = nn.Unfold(kernel_size=(7, 7), stride=(4, 4), padding=(2, 2))

T2T-ViT/models/token_performer.py

Line 46 in fecacc4

k, q, v = torch.split(self.kqv(x), self.emb, dim=-1)

Hi Jamin,

Good question! Your observation is interesting. But you cannot directly combine unfold with qkv as there is a layernorm before qkv operation in self-attention.

Thanks so much for your quick reply. Yes, the layernorm does matter and may produce some effect.

Question about the 'unfold operation.