YuchenLiu98 / COMM

Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CLIP-VG (which has already been accepted by TMM 2023) already proposed to utilize the multi-level visual feature of CLIP to realize the visual grounding task, maybe this paper should cite this reference.

linhuixiao opened this issue · comments

Hi, CLIP-VG [1] (which has already been accepted by TMM 2023) already proposed to utilize the multi-level visual feature of CLIP to realize the visual grounding task, maybe this paper should cite this reference. Thanks.

[1] Xiao, Linhui, et al. "CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding." IEEE Transactions on Multimedia (2023).
https://ieeexplore.ieee.org/abstract/document/10269126
https://arxiv.org/abs/2305.08685