microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Home Page:https://aka.ms/GeneralAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Text-Image Matching in LayoutLMV2

fatfishZhao opened this issue · comments

Describe
Model I am using (LayoutLMV2).

Hi, Thanks for your great work on LayoutLMV2. I am trying to reproduce the pretraining process of this model. I have 2 questions about Text-Image Matching(TIM).

  1. When constructing a negative sample, the paper said "perform the same masking and covering operations to images in negative samples". So on image, most of the texts are masked by shuffled boxes, text lines are cut randomly, which I think is a very obvious visual presentation. Does it makes TIM task very easy to be learned?

  2. when the image size of negative image is different to the positive image, how to draw masks on negative image? Do I need to first resize it to the positive size?

Hope someone can give me some help.

Thanks.

  1. Text-image Matching is easier to learn than the text-image alignment task.
  2. All images are resized to 224x224.