why not use self.clip.transformer when training?

Question

why not use self.clip.transformer when training?

adda1221 opened this issue 8 months ago · comments

Hi, thank you for your exciting work! I noticed that you use self.clip.transformer for processing images during the inference stage. However, during the training stage, image processing is accomplished using torchvision. Are there any differences between these two methods? Thanks for your reply!

Jiaming Han · Answer 1 · Wed Sep 20 2023 18:53:38 GMT+0800 (China Standard Time)

Hi @adda1221 , we do not use clip.transformer (the text encoder of clip) during inference. In the training stage, we use simple torchvision transforms which is the same as CLIP's transforms