yitu-opensource / T2T-ViT

ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the 'unfold operation.

JaminFong opened this issue · comments

Hi Li,
I have just read your paper "Tokens-to-Token ViT", which proposes a very interesting and effective method. I have a question puzzling me.
The "unfold" operation followed by a linear layer for generating "qkv" seemingly equals a kxk convolution. I wonder whether I understand right. Please correct me if I'm wrong.
Hoping for your reply.
Best regards.

self.soft_split0 = nn.Unfold(kernel_size=(7, 7), stride=(4, 4), padding=(2, 2))

k, q, v = torch.split(self.kqv(x), self.emb, dim=-1)

Hi Jamin,

Good question! Your observation is interesting. But you cannot directly combine unfold with qkv as there is a layernorm before qkv operation in self-attention.

Thanks so much for your quick reply. Yes, the layernorm does matter and may produce some effect.