Flattened features of VIT
Alir3za97 opened this issue · comments
The paper states [section 4.1] that all experiments use the same architecture designs in CLIP, but after checking out the code I noticed that:
1 - There is no cls_token embedding in unicom VIT models,
2 - Output features of the VITs are neither pooled features of the blocks nor the cls_token, and are actually flattened and then passed to mlp layers.
We adopted clip as the backbone and used arcface design for the generation of embeddings.
https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/fresnet.py#L1101
https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/symbol_utils.py#L78