dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about image transformation: short edge is still 384 for the fine-tuning task?

Jxu-Thu opened this issue · comments

Thanks for your great codes!
I carefully read your paper.

(in your paper) We resize the shorter edge of input images to 384 and limit the longer edge to under 640 while preserving the aspect ratio. This resizing scheme is also used during object detection in other VLP models, but with a larger size of the shorter edge (800). Patch projection of ViLT-B/32 yields 12 × 20 = 240 patches for an image with a resolution of 384*640.

However, I find that the "image_size=384" for all downstream tasks in this codes?

Would it have an effect on the performance of downstream tasks? At least with a shorter edge 800 can greatly increase the length of the sequence. So It should have a smaller batch size when using "shorter edge 800"

We do use the shorter size of 384 for downstream tasks.
def config() is the default configuration, and the values in the configuration are used as-is unless named configs or command-line modifications do not modify them.

You can check the final configuration of an execution by print_config option.

Thanks.