DAMO-NLP-SG / VCD

[CVPR 2024 Highlight] Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Special Hyperparameter value or Checkpoints are needed?

yejipark-m opened this issue · comments

Hello, thanks for sharing your nice work.

I've encountered a problem when trying to reproduce the reported results, even after considering for the standard deviations. I'm currently utilizing the model checkpoints from Hugging Face.

model_paths[instructblip]="~/.cache/huggingface/hub/models--lmsys--vicuna-7b-v1.1"
model_paths[llava]="liuhaotian/llava-v1.5-7b"
model_paths[qwenvl]="Qwen/Qwen-VL"

I've kept the hyperparameters at their default settings:

python3 eval/object_hallucination_vqa_${model}.py --model-path ${model_paths[$model]} --question-file data/POPE/aokvqa/aokvqa_pope_${type}.json --image-folder data/MSCOCO/val2014 --answers-file ./output/${model}/aokvqa_pope_${type}_vcd.jsonl --use_cd

parser.add_argument("--noise_step", type=int, default=500)
parser.add_argument("--use_cd", action='store_true', default=False)
parser.add_argument("--cd_alpha", type=float, default=1)
parser.add_argument("--cd_beta", type=float, default=0.1)
parser.add_argument("--seed", type=int, default=42)

However, with these checkpoints and hyperparameters, the output numbers are significantly lower than your reported performance, particularly on the GQA dataset and A-OKVQA dataset. The results without VCD are somewhat similar to the reported numbers, but using VCD seems to be where the issue lies.

Could you specify if there are any particular checkpoints you used for each model? Sharing the exact setup or recipe for correct inference would be greatly appreciated.

Hi, thanks for your interest. Please refer to "Implementation Details" in Section 4.1 and Appendix A. for our hyperparameter settings for specific experiments.

Thanks for your reply. I had misused the noise step when evaluating POPE.
I'll close the issue.