Add ViP-LLaVA

Question

Add ViP-LLaVA

mu-cai opened this issue 4 months ago · comments

Hi,

Consider adding ViP-LLaVA? https://vip-llava.github.io/
ViP-LLaVA is a region-level large multimodal model that allows arbitrary visual prompts as the input.

Thanks
Mu

Mu Cai · Answer 1 · Mon Feb 05 2024 06:41:12 GMT+0800 (China Standard Time)

Along with the proposed method, we release the instruction finetuning dataset for visual prompts. https://huggingface.co/datasets/mucai/ViP-LLaVA-Instruct,

and the evaluation benchmark for visual prompts: ViP-Bench https://huggingface.co/datasets/mucai/ViP-Bench

xjtupanda · Answer 2 · Mon Feb 05 2024 16:05:21 GMT+0800 (China Standard Time)

These works have been added to our repo.
Please consider citing our works:

@article{yin2023survey,
  title={A Survey on Multimodal Large Language Models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Li, Ke and Sun, Xing and Xu, Tong and Chen, Enhong},
  journal={arXiv preprint arXiv:2306.13549},
  year={2023}
}

@article{fu2023mme,
  title={MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models},
  author={Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong},
  journal={arXiv preprint arXiv:2306.13394},
  year={2023}
}

@article{yin2023woodpecker,
  title={Woodpecker: Hallucination Correction for Multimodal Large Language Models},
  author={Yin, Shukang and Fu, Chaoyou and Zhao, Sirui and Xu, Tong and Wang, Hao and Sui, Dianbo and Shen, Yunhang and Li, Ke and Sun, Xing and Chen, Enhong},
  journal={arXiv preprint arXiv:2310.16045},
  year={2023}
}

@article{fu2023gemini,
  title={A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise},
  author={Fu, Chaoyou and Zhang, Renrui and Wang, Zihan and Huang, Yubo and Zhang, Zhengye and Qiu, Longtian and Ye, Gaoxiang and Shen, Yunhang and Zhang Mengdan and Chen, Peixian and Zhao, Sirui and Lin, Shaohui and Jiang, Deqiang and Yin, Di and Gao, Peng and Li, Ke and Li, Hongsheng and Sun, Xing},
  journal={arXiv preprint arXiv:2312.12436},
  year={2023}
}

Mu Cai · Answer 3 · Tue Apr 02 2024 15:55:16 GMT+0800 (China Standard Time)

Can you update ViP-LLaVA Making Large Multimodal Models Understand Arbitrary Visual Prompts

to be accepted by CVPR? Thanks

xjtupanda · Answer 4 · Tue Apr 02 2024 16:16:32 GMT+0800 (China Standard Time)

The corresponding item has been updated.