small difference between paper and code about token type embedding
AAbathur opened this issue · comments
Thanks for your paper and code, it helps me a lot.
There is a small problem that makes me feel confused. In your paper 3.1, the text embedding consists of word embedding, position embedding, and modal-type embedding.
while in the source code of vilt/modules/vilt_module.py, the text_embedding is implemented by:
from transformers.models.bert.modeling_bert import BertConfig, BertEmbeddings
...
self.text_embeddings = BertEmbeddings(bert_config)
and an extra token_type embedding
self.token_type_embeddings = nn.Embedding(2, config["hidden_size"])
As I know, BertEmbedding() already contains a token type embedding operation inside, so there are actually two token type embedding for text input, and one token type embedding for image input.
I know the self.token_type_embeddings is used as the modal_type embedding to distinguish between image and text.
Is it a mistake? Is it ok not to remove the token type embedding inside BertEmbeddings(bert_config)? Will it cause any difference?
Hope for your reply, thanks!
Hi @AAbathur
Yes, BertEmbeddings have their own self.token_type_embeddings to distinguish two sentences those used for the next sentence prediction (NSP) objective.
However, since we only pass one sentence to the BertEmbeddings at a time, all text tokens will be added with the same self.token_type_embeddings inside BertEmbeddings. (self.token_type_embeddings[0])
If all tokens were added with the same constant vector, you can basically think of the combined vector (self.word_embeddings + self.token_type_embeddings[0]) as a new self.word_embeddings. (all word embeddings are shifted by self.token_type_embeddings[0].)
FYI, we've tested our own text embedder without its internal self.token_type_embeddings, and the result was the same.
ok, now I understand this situation, thanks very much for your detailed reply,