Inpressing work! Having some questions about "Long-Clip"

Question

Inpressing work! Having some questions about "Long-Clip"

psp3dcg opened this issue a month ago · comments

psp3dcg commented a month ago

Hi, it is really a impressing work! After reading the paper, I have some question as follows,

How to set the value of the interpolation ratio λ2 in knowledge-stretching as 4? Have you used other values and how is the result?
I noticed that you trained the model on the ShareGPT4V containing only (long caption, image) pairs. Is it possible to train the model on the dataset including both long and short captions?
What is the difference between "positional_embedding" and "positional_embedding_res" described as the line 254 of "longclip.py" and line 359 of "model_longclip.py"

Looking forward to your reply, thank U~

Beichen Zhang · Answer 1 · Sun Jun 30 2024 14:41:42 GMT+0800 (China Standard Time)

Thanks. Here we respond to your questions.

a lower ratio will result in a shorter maximum length, but will have a better performance. It's a trade-off.
The long captions in ShareGPT4V always provide a summary in its first sentence like `the image showcases ......'. Therefore, we take the first sentence as the short captions.
it's used to keep the first 20 positional embedding frozen during training.

psp3dcg · Answer 2 · Sun Jun 30 2024 16:31:56 GMT+0800 (China Standard Time)

Thanks. Here we respond to your questions.

a lower ratio will result in a shorter maximum length, but will have a better performance. It's a trade-off.

The long captions in ShareGPT4V always provide a summary in its first sentence like `the image showcases ......'. Therefore, we take the first sentence as the short captions.

it's used to keep the first 20 positional embedding frozen during training.

OK, thank you for the quick reply!