deepglint / unicom

[ICLR 2023] Unicom: Universal and Compact Representation Learning for Image Retrieval

Home Page:https://arxiv.org/pdf/2304.05884.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Flattened features of VIT

Alir3za97 opened this issue · comments

The paper states [section 4.1] that all experiments use the same architecture designs in CLIP, but after checking out the code I noticed that:
1 - There is no cls_token embedding in unicom VIT models,
2 - Output features of the VITs are neither pooled features of the blocks nor the cls_token, and are actually flattened and then passed to mlp layers.