Questions about your training procedure?

Question

Questions about your training procedure?

GYslchen opened this issue 2 years ago · comments

To my understanding, I think you use image-text pairs as inputs and only bbox annotations as supervision signals without any class labels, does it right?

Muhammad Maaz · Answer 1 · Mon Dec 05 2022 19:41:37 GMT+0800 (China Standard Time)

Hi @GYslchen,

Thank you for your interest in our work. We are using aligned image-text pairs for pretraining our MAVL model. Similar to MDETR, MAVL uses soft-token alignment loss during pretraining where a uniform probability distribution is predicted over all text tokens for each detected object. Please refer to the Sec. 2.2.2 and Appendix A of MDETR paper for more details.

I hope this would be helpful. Please let me know if you have any questions. Thanks