dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

How to use the modal-type embedding in the output of encoder?

leyuan-sun opened this issue 2 years ago · comments

leyuan-sun commented 2 years ago

How to use the modal-type embedding in the output of encoder?

rginjapan commented 2 years ago

Sorry, my questions is how can I use modal-type embedding to know which feature is belong to which modal in the output? Thanks in advance!!