yunqing-me / AttackVLM

[NeurIPS-2023] Annual Conference on Neural Information Processing Systems

Home Page:https://arxiv.org/pdf/2305.16934.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why only using the first dimension of the image encoding?

Yuancheng-Xu opened this issue · comments

Hi!

I have a question on how the image embedding is used in the code here: https://github.com/yunqing-me/AttackVLM/blob/e5806028d490846c76375342ddfd779d197111ae/MiniGPT-4/_train_adv_img_trans.py#L132C13-L132C62

In particular tgt_image_features = (tgt_image_features)[:,0,:]. Why is this step necessary? Isn't chat.forward_encoder(image_tgt) already the image embedding. I wonder why we have to extract the image embedding using [:,0,:]. Thank you!