ziniuwan / maed

[ICCV 2021] Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why does the code use pretrained ViT while the paper does not describe such implementation at all?

dqj5182 opened this issue · comments

Also, there seems only one positional encoding while figure gives two (one for spatial and one for temporal) positional encoding.