why not use self.clip.transformer when training?
adda1221 opened this issue · comments
adda1221 commented
Hi, thank you for your exciting work! I noticed that you use self.clip.transformer
for processing images during the inference stage. However, during the training stage, image processing is accomplished using torchvision. Are there any differences between these two methods? Thanks for your reply!
Jiaming Han commented
Hi @adda1221 , we do not use clip.transformer
(the text encoder of clip) during inference. In the training stage, we use simple torchvision transforms which is the same as CLIP's transforms