Why only using the first dimension of the image encoding?

Question

Why only using the first dimension of the image encoding?

Yuancheng-Xu opened this issue 7 months ago · comments

Hi!

I have a question on how the image embedding is used in the code here: https://github.com/yunqing-me/AttackVLM/blob/e5806028d490846c76375342ddfd779d197111ae/MiniGPT-4/_train_adv_img_trans.py#L132C13-L132C62

In particular tgt_image_features = (tgt_image_features)[:,0,:]. Why is this step necessary? Isn't chat.forward_encoder(image_tgt) already the image embedding. I wonder why we have to extract the image embedding using [:,0,:]. Thank you!