beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inpressing work! Having some questions about "Long-Clip"

psp3dcg opened this issue · comments

Hi, it is really a impressing work! After reading the paper, I have some question as follows,

  1. How to set the value of the interpolation ratio λ2 in knowledge-stretching as 4? Have you used other values and how is the result?
  2. I noticed that you trained the model on the ShareGPT4V containing only (long caption, image) pairs. Is it possible to train the model on the dataset including both long and short captions?
  3. What is the difference between "positional_embedding" and "positional_embedding_res" described as the line 254 of "longclip.py" and line 359 of "model_longclip.py"

Looking forward to your reply, thank U~

Thanks. Here we respond to your questions.

  1. a lower ratio will result in a shorter maximum length, but will have a better performance. It's a trade-off.
  2. The long captions in ShareGPT4V always provide a summary in its first sentence like `the image showcases ......'. Therefore, we take the first sentence as the short captions.
  3. it's used to keep the first 20 positional embedding frozen during training.

Thanks. Here we respond to your questions.

  1. a lower ratio will result in a shorter maximum length, but will have a better performance. It's a trade-off.
  2. The long captions in ShareGPT4V always provide a summary in its first sentence like `the image showcases ......'. Therefore, we take the first sentence as the short captions.
  3. it's used to keep the first 20 positional embedding frozen during training.

OK, thank you for the quick reply!