Why only using the first dimension of the image encoding?
Yuancheng-Xu opened this issue · comments
Hi!
I have a question on how the image embedding is used in the code here: https://github.com/yunqing-me/AttackVLM/blob/e5806028d490846c76375342ddfd779d197111ae/MiniGPT-4/_train_adv_img_trans.py#L132C13-L132C62
In particular tgt_image_features = (tgt_image_features)[:,0,:]
. Why is this step necessary? Isn't chat.forward_encoder(image_tgt)
already the image embedding. I wonder why we have to extract the image embedding using [:,0,:]
. Thank you!