Why does the code use pretrained ViT while the paper does not describe such implementation at all?
dqj5182 opened this issue · comments
Daniel Sungho Jung commented
Also, there seems only one positional encoding while figure gives two (one for spatial and one for temporal) positional encoding.