Flattened features of VIT

Question

Flattened features of VIT

Alir3za97 opened this issue a year ago · comments

The paper states [section 4.1] that all experiments use the same architecture designs in CLIP, but after checking out the code I noticed that:
1 - There is no cls_token embedding in unicom VIT models,
2 - Output features of the VITs are neither pooled features of the blocks nor the cls_token, and are actually flattened and then passed to mlp layers.

Xiang An · Answer 1 · Fri Apr 14 2023 19:22:14 GMT+0800 (China Standard Time)

We adopted clip as the backbone and used arcface design for the generation of embeddings.

https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/fresnet.py#L1101
https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/symbol_utils.py#L78