beichenzbc / Long-CLIP

[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A question about the similarity of long text and image

Yu-xm opened this issue · comments

I tried to use the first example in Fig.5 in your paper to calculate the similarity, I reduced the caption, now I have three different lengths of caption to correspond to the same image. caption is as follows:

1、"Man in black jacket crosses city street with green light and colorful cars."
2、 "A man in a black jacket crosses a busy street with colorful cars, under a green traffic light, flanked by tall buildings and trees, under a clear sky."
3、"A man in a black jacket is crossing a busy city street. The street is filled with cars of various colors, including yellow taxis and red trucks. A traffic light hangs overhead, currently displaying a green signal. The perspective of the photo is from the sidewalk, giving a sense of being part of the city's hustle and bustle. The sky above is clear, suggesting good weather. The street is lined with tall buildings and trees, creating a vibrant cityscape."

The theoretical result should be that caption3 is more similar than caption2 and caption2 is similar than caption1, but the result is that caption2 is more similar than caption1 and caption1 is similar than caption3. Can you explain this phenomenon?

Refer to the discussion in this issue. #9