Question about "Projector"

Question

Question about "Projector"

Joshmeaning opened this issue 3 months ago · comments

Hello,

As I was meticulously reading a paper, I found myself confused about the section on 'projectors.'

Background: From what I understand so far, in the case of CLIP ViT Large, despite the complex computations involved such as dividing into patches and embedding dimensions, the output visual feature is known to be a single size vector embedding. (Source: https://www.pinecone.io/learn/series/image-search/clip/)
This vector is input into a 'projector,' which after undergoing complex calculations again, outputs a single size vector embedding. This vector then serves as an input to LLM.

In the Honeybee paper, I felt that there was a significant improvement in the 'projector' section.

When visual features are projected, in the case of LLaVA, a Linear projection is used, which I understood increases the amount of computation and results in input and output sizes being the same embedding, as pointed out.
In cases where a structure like Q-former is used, the size of input and output embeddings is reduced, which can be expected to improve efficiency. However, I understood that this could lead to information loss due to dimensionality reduction.

Based on these issues, the Honeybee paper proposes the C-Abstractor and D-Abstractor, and I felt that it addressed the two problems mentioned above.

Did I understand this correctly? Please point out any mistakes in my understanding, including the background. It would be greatly helpful.

Thanks Kakao team!

Junbum Cha · Answer 1 · Thu Apr 11 2024 21:49:28 GMT+0800 (China Standard Time)

[image adopted from ViT paper]

CLIP ViT image encoder basically uses only the output feature of [CLS] token (the first token with index 0 in the figure). MLLMs utilize not only [CLS] token, but also the other local tokens (1-9 tokens in the figure). Our design principle argues that the projector should be able to reduce the number of local tokens, for the controllability of efficiency.

Q-former or Resampler has flexibility, which means it can reduce the number of local tokens, but they tend to fail to preserve local context (i.e., locality); this is different from dimensionality reduciton.