yitu-opensource / T2T-ViT

ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

position embedding

jingliang95 opened this issue · comments

Hi authors.
I think the sentence "we concatenate a class token to it and then add Sinusoidal Position Embedding (PE) to it, the same as ViT to do classification" in your paper is confusing. In ViT, position embedding is learnable, while your method fixes the position embedding as Sinusoidal (please correct me if I am wrong). So here "the same as ViT to do classification" is confusing. I think you mean adding class token and position embedding is similar to ViT. Maybe you can modify this in your paper.

Regarding position embedding, when finetuning with a different image size (512*512), does simply changing the length of position embedding work? If you modify the length of position embedding, then the position embedding will be totally different from the pretraining, since you can don't load the position embedding in pretrained model. Am I correct? Thanks in advance

Hi,
About "we concatenate a class token to it and then add Sinusoidal Position Embedding (PE) to it, the same as ViT to do classification", in the origianl ViT, it includes some different position embeddings: 1. Sinusoidal Position Embedding (PE); 2. 1D or 2D learned parameters; 3. Relative PE. They tried all in the original paper and you can check.

About the position embedding size, we will release our codes to do interpolation on the pretraine position embedding very soon.