Substantially more parameters than OpenCLIP, SWAG and Timm's ViT models.

Question

Substantially more parameters than OpenCLIP, SWAG and Timm's ViT models.

yash0307 opened this issue a year ago · comments

Hello,

Many thanks for sharing your interesting work. I noticed that the projection head of your models is substantially bigger than SWAG (Singh et al., CVPR 2022), OpenCLIP models and Timm's implementation of ViT that is used in recall@k surrogate (Patel et al., CVPR 2022). I ran a quick parameter counter for these models following the RS@k implementation, that is, with a layer norm and linear projection. Here are the counts:

ViT-B/32 Timm: 87850496
ViT-B/32 CLIP: 87849728
ViT-B/32 UNICOM: 117118464
ViT-B/16 Timm: 86193920
ViT-B/16 CLIP: 86193152
ViT-B/16 UNICOM: 202363136
ViT-B/16 SWAG: 86193920

It is clear that the UNICOM model has substantially higher number of parameters than the baselines used for the comparison. With this in mind, are the comparisons fair at all?

Xiang An · Answer 1 · Sun May 14 2023 21:06:51 GMT+0800 (China Standard Time)

Greetings, thank you for showing interest in our research work.

The projection head structure used in our ViT model is taken from the paper that follows the arcface.
https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/fresnet.py#L1101
https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/symbol_utils.py#L78
We will shortly update the experiment results on Github with a projection head structure similar to CLIP.

Yash Patel · Answer 2 · Sun May 14 2023 21:52:42 GMT+0800 (China Standard Time)

Thank you for the prompt reply. Looking forward to new results.

Xiang An · Answer 3 · Mon Jul 03 2023 22:45:30 GMT+0800 (China Standard Time)

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

	cub	car	sop	inshop	inat
unicom	83.7	95.9	70.0	72.8	64.6
new	83.4	95.5	71.0	75.0	64.9

Model

This is the model file:
https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=sharing

Usage

You can use it like this:

import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

yaojunr · Answer 4 · Tue Jul 11 2023 16:39:49 GMT+0800 (China Standard Time)

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file: https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=drive_link

Usage

You can use it like this:
import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

HI, the model file has no permission to download, can you open the permission? Thank you very much.

Xiang An · Answer 5 · Tue Jul 11 2023 17:57:15 GMT+0800 (China Standard Time)

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file: https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=drive_link

Usage

You can use it like this:
import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)
HI, the model file has no permission to download, can you open the permission? Thank you very much.

we have updated it