DCDmllm / Cheetah

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the l4 benchmark's leaderboard

BellXP opened this issue · comments

commented

Firstly, I really appreciate your outstanding contribution to the LVLM field.

But I am confused about the leaderboard of I4 Benchmark as many models in that leaderboard do not support the input of one text prompt with multiple images. Therefore, it's really confusing for me how you get the result of the models like InstructBLIP, LLaMA-Adapter-v2, and so on.

Thank you for recognizing our contribution to the LVLM field. For models that don't inherently support multiple image inputs, our workaround has been to concatenate the embeddings of all the images, which can be explained as treating images as frames of a video. To ensure the positional information of each image is retained in an interleaved image-text instruction, we explicitly indicate the location of each image within the context.