Question about the l4 benchmark's leaderboard

Question

Question about the l4 benchmark's leaderboard

BellXP opened this issue a year ago · comments

Firstly, I really appreciate your outstanding contribution to the LVLM field.

But I am confused about the leaderboard of I4 Benchmark as many models in that leaderboard do not support the input of one text prompt with multiple images. Therefore, it's really confusing for me how you get the result of the models like InstructBLIP, LLaMA-Adapter-v2, and so on.

Zhiqi Ge · Answer 1 · Sat Aug 19 2023 12:30:05 GMT+0800 (China Standard Time)

Thank you for recognizing our contribution to the LVLM field. For models that don't inherently support multiple image inputs, our workaround has been to concatenate the embeddings of all the images, which can be explained as treating images as frames of a video. To ensure the positional information of each image is retained in an interleaved image-text instruction, we explicitly indicate the location of each image within the context.