Why does the code use pretrained ViT while the paper does not describe such implementation at all?

Question

dqj5182 opened this issue 2 years ago · comments

Also, there seems only one positional encoding while figure gives two (one for spatial and one for temporal) positional encoding.