KaiyangZhou / CoOp

Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Important changes made to Dassl's transforms.py

KaiyangZhou opened this issue · comments

So, you might find OpenAI's code produces around 59% accuracy for zero-shot CLIP (vision_model=RN50) on ImageNet with prompt ensembling, but CoOp's code gives only 57.81% for the same model (see Table 7 in the paper).

This difference is caused by using different transforms: OpenAI's code applies Resize(224) to an image while CoOp's code (the previous version) uses Resize((224, 224)). More specifically, the former keeps the image aspect ratio while the latter doesn't. To allow the results produced by CoOp's code to be comparable to OpenAI's code, we have made our transforms consistent with theirs. So the transforms in the config files have now been changed from ["random_flip", "random_translation", "center_crop", "normalize"] to ["random_resized_crop", "random_flip", "normalize"].

If you are using our Dassl-based CoOp code, please update the code to the latest version. If you want to use your own code, you can simple copy CoOp's model code (i.e. CustomCLIP) and do the comparison on the same ground with whatever pipelines you are using.

For your reference, we have rerun CoOp using the new config files and put below the comparison of Table 7's results.

Previous version

Method RN50 Rn101 ViT-B/32 ViT-B/16
Prompt engineering 55.41 58.72 59.88 64.71
Prompt ensembling 57.81 60.49 62.01 67.31
CoOp 60.46 64.39 64.92 70.13

Current version

Method RN50 Rn101 ViT-B/32 ViT-B/16
Prompt engineering 58.18 61.26 62.05 66.73
Prompt ensembling 60.41 62.54 63.71 68.74
CoOp 62.95 66.60 66.85 71.92