YuchenLiu98 / COMM

Pytorch code for paper From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clarification on the Implementation of the Linear-LayerNorm (LLN) Module

mu-cai opened this issue · comments

Hello Authors,

I am writing to seek clarification on a specific aspect of your implementation concerning the Linear-LayerNorm (LLN) module that is used to align the feature space between different layers' features. The paper states that a linear-layernorm module is leveraged, followed by a weighted sum through layerscale as:

\bar{z} = w_1 \cdot \text{LLN}(z_1) + \cdots + w_N \cdot \text{LLN}(z_N)

My question revolves around the specifics of the implementation:

Is there a single shared Linear and LayerNorm module that is applied to the features from all layers, or does each layer have its own dedicated Linear and LayerNorm module? (I guess should be different across layers?)

As for the CLIP and DINO features, it should be coated along feature dimension, not the token dimension, right?

Could you provide a snippet or a pseudo-code example of how the LLN layers and scaling parameters are implemented for the CLIP and DINO features?

If there is any additional documentation or examples available, I would greatly appreciate if you could point me in the right direction.

Thank you for your time and for your contributions to the field. I look forward to your response.

Best regards,

Hello! Thanks for your questions.
For question 1, yes and we implement its own Linear and LayerNorm for each layer. (Specifically nn.Linear and nn.LayerNorm)
For question 2, yes and we concatenate the CLIP and DINO features along feature dimension. Then we apply a linear layer for projecting to the half of concatenated feature dimension (the same as original dimension of single CLIP).
I apologize that the code is under the corporation's legal review. If you have any other questions, feel free to contact me. Thanks.