deepglint / unicom

[ICLR 2023] Unicom: Universal and Compact Representation Learning for Image Retrieval

Home Page:https://arxiv.org/pdf/2304.05884.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Substantially more parameters than OpenCLIP, SWAG and Timm's ViT models.

yash0307 opened this issue · comments

Hello,

Many thanks for sharing your interesting work. I noticed that the projection head of your models is substantially bigger than SWAG (Singh et al., CVPR 2022), OpenCLIP models and Timm's implementation of ViT that is used in recall@k surrogate (Patel et al., CVPR 2022). I ran a quick parameter counter for these models following the RS@k implementation, that is, with a layer norm and linear projection. Here are the counts:

ViT-B/32 Timm: 87850496
ViT-B/32 CLIP: 87849728
ViT-B/32 UNICOM: 117118464
ViT-B/16 Timm: 86193920
ViT-B/16 CLIP: 86193152
ViT-B/16 UNICOM: 202363136
ViT-B/16 SWAG: 86193920

It is clear that the UNICOM model has substantially higher number of parameters than the baselines used for the comparison. With this in mind, are the comparisons fair at all?

Greetings, thank you for showing interest in our research work.

  1. The projection head structure used in our ViT model is taken from the paper that follows the arcface.
    https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/fresnet.py#L1101
    https://github.com/deepinsight/insightface/blob/master/recognition/arcface_mxnet/symbol/symbol_utils.py#L78
  2. We will shortly update the experiment results on Github with a projection head structure similar to CLIP.

Thank you for the prompt reply. Looking forward to new results.

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file:
https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=sharing

Usage

You can use it like this:

import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file: https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=drive_link

Usage

You can use it like this:

import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

HI, the model file has no permission to download, can you open the permission? Thank you very much.

This is the performance of the model we trained using the same ViT architecture as CLIP.

Results

cub car sop inshop inat
unicom 83.7 95.9 70.0 72.8 64.6
new 83.4 95.5 71.0 75.0 64.9

Model

This is the model file: https://drive.google.com/file/d/1dSrWAmoPqr8d9oB1wggnZfgHnBdre2wa/view?usp=drive_link

Usage

You can use it like this:

import clip
model, transform = clip.load("ViT-B/32", "cpu")
model = model.visual
state_dict = torch.load("ViT-B-32.pt", "cpu")
model.load_state_dict(state_dict, strict=True)

HI, the model file has no permission to download, can you open the permission? Thank you very much.

we have updated it