dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

small difference between paper and code about token type embedding

AAbathur opened this issue · comments

Thanks for your paper and code, it helps me a lot.
There is a small problem that makes me feel confused. In your paper 3.1, the text embedding consists of word embedding, position embedding, and modal-type embedding.
vilt-3 1

while in the source code of vilt/modules/vilt_module.py, the text_embedding is implemented by:

from transformers.models.bert.modeling_bert import BertConfig, BertEmbeddings
...
  self.text_embeddings = BertEmbeddings(bert_config)

and an extra token_type embedding
self.token_type_embeddings = nn.Embedding(2, config["hidden_size"])
As I know, BertEmbedding() already contains a token type embedding operation inside, so there are actually two token type embedding for text input, and one token type embedding for image input.
I know the self.token_type_embeddings is used as the modal_type embedding to distinguish between image and text.
Is it a mistake? Is it ok not to remove the token type embedding inside BertEmbeddings(bert_config)? Will it cause any difference?
Hope for your reply, thanks!

Hi @AAbathur

Yes, BertEmbeddings have their own self.token_type_embeddings to distinguish two sentences those used for the next sentence prediction (NSP) objective.
However, since we only pass one sentence to the BertEmbeddings at a time, all text tokens will be added with the same self.token_type_embeddings inside BertEmbeddings. (self.token_type_embeddings[0])
If all tokens were added with the same constant vector, you can basically think of the combined vector (self.word_embeddings + self.token_type_embeddings[0]) as a new self.word_embeddings. (all word embeddings are shifted by self.token_type_embeddings[0].)

FYI, we've tested our own text embedder without its internal self.token_type_embeddings, and the result was the same.

ok, now I understand this situation, thanks very much for your detailed reply,