TRI-ML / vlm-evaluation

VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when evaluating on POPE-full

djghosh13 opened this issue · comments

When evaluating LLaVA on the pope-full dataset, I'm getting the following error:

Traceback (most recent call last):
  File "/weka/home-djghosh/code/llava_ft/vlm-evaluation/scripts/score.py", line 145, in <module>
    score()
  File "/weka/home-djghosh/miniconda3/envs/vlm/lib/python3.10/site-packages/draccus/argparsing.py", line 203, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/weka/home-djghosh/code/llava_ft/vlm-evaluation/scripts/score.py", line 104, in score
    len(full_results) == cfg.dataset.expected_examples
AssertionError: Expected 9000 model outputs, only found 8910!

It seems like POPE expects 3000 samples each from the popular, random, and adversarial splits, but the random sample contains only 2910 samples. Looking into it, from https://github.com/TRI-ML/vlm-evaluation/tree/main/vlm_eval/tasks/registry.py the download source for POPE is an older commit in which the random split does indeed have 2910 examples and not 3000.

Which of these is the expected dataset for comparison for Prismatic VLM? i.e. should pope-full expect 8910 examples or does the dataset download source need to be updated? Thanks!

It also seems like at

count := {"adversarial": 3000, "popular": 3000, "random": 2910}[split]
in your code, it does indeed expect 2910 examples in the random split. So the dataset preparation scripts disagree with the evaluation script on how many examples there should be.

Our current code (to reproduce the results in the paper) expects the older version of the POPE dataset with the 2910 examples.

However, happy to update this moving forward — what do you think @djghosh13?

Thanks for clarifying!

The newer commits to POPE seem to fix a lot of spelling errors in addition to adding in the 90 missing examples, so it would be preferable to run on that, probably. The changes to your code should be minimal (just updating 8910 to 9000 and replacing the commit ID in the download script), but it's just a matter of whether it's worth re-running POPE evals for the already tested models.

@djghosh13 - closing this for now as we're switching over to the newest version of POPE. Will respond to this issue once we get new results on all the models (as part of our next release!)