CLIP-VG (which has already been accepted by TMM 2023) already proposed to utilize the multi-level visual feature of CLIP to realize the visual grounding task, maybe this paper should cite this reference.

Question

CLIP-VG (which has already been accepted by TMM 2023) already proposed to utilize the multi-level visual feature of CLIP to realize the visual grounding task, maybe this paper should cite this reference.

linhuixiao opened this issue 9 months ago · comments

Hi, CLIP-VG [1] (which has already been accepted by TMM 2023) already proposed to utilize the multi-level visual feature of CLIP to realize the visual grounding task, maybe this paper should cite this reference. Thanks.

[1] Xiao, Linhui, et al. "CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding." IEEE Transactions on Multimedia (2023).
https://ieeexplore.ieee.org/abstract/document/10269126
https://arxiv.org/abs/2305.08685