kakaobrain / honeybee

Official implementation of project Honeybee (CVPR 2024)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I don't get it, the code just provide CAbstracter + MLP and claims Resampler is not good?

lucasjinreal opened this issue · comments

It should at least CAbstracter + Resampler to make resampler work, not + MLP

image

I'm not quite sure I follow your point.

First, C-Abstractor inherently consists of both convolution and MLP layers. Therefore, the code in question simply implements the C-Abstractor itself, not C-Abstractor + MLP.

As for your mention of "CAbstracter + Resampler," are you referring to a combination of convolution with Resampler? Could you clarify the intended purpose and necessity of this experiment? Since both Resampler and C-Abstractor are standalone projectors capable of being evaluated on their own, I'm curious as to why you think a convolution layer is necessary for this comparison.

It is not very convincing you compare naive Resampler with CAbstactor and claim that Resampler has limitations, as a fact, Many Resamplers can have much more better result then MLP with less tokens.

Also, it is not very convining that you compare CAbstactor + MLP with MLP and claim CAbstactor better than MLP, from the abslation study it shows actually MLP better than CAbstactor + MLP when it comes to same tokens, even the CAbstractor introduce more params.

image

You can check the leaderboard of MMBench, it has many Resampler method with even less LLM sizes excceded Honeybee.

And the last, from the paper first glance, I thought you were refering limitations to Resampler and try to solve it based on Resampler, but you didn't, even thought it can have same effect with Resampler, but still not resolve the issues on Resampler itself.

There are several points to address.

MLP or C-Abstractor usually better than Resampler. In most cases, Resampler does not perform as well as MLP or C-Abstractor. Comparing them on MMB leaderboard is not fair comparison, since there are many difference between models beyond projector. Please compare the models under fair condition.

I found one another fair comparison in InternLM-XComposer2-4KHD paper:

image

(image from InternLM-XComposer2-4KHD paper)

But this does not mean that C-Abstractor is always better than Resampler. It's not definitive that C-Abstractor is always better than Resampler. We addressed the issue of local context preservation (which is the problem of resampler we pointed in the paper) from a projector architecture perspective. Maybe we can address this issue through dataset perspective, though we haven't tried this due to resource constraints.

However, personally, I think that using Resampler to achieve better performance with fewer tokens than MLP in the current MLLM structure is very challenging. Even if we address the locality preservation issue in the Resampler from a data perspective, reducing the number of visual tokens while enhancing performance is difficult (as shown in our paper, the number of visual tokens is important). The top-performing models in MMBench that you mentioned all do not use Resampler too (InternVL, TransCore-M, LLaVA-NeXT, ...).

Size of MLP. If you believe the number of parameters is crucial, please refer to our appendix on the 6-layer MLP experiment.

image

As you can see from the table above, simply increasing the number of parameters does not improve MLP's performance.

Fair comparison between MLP and C-Abstractor. I do not agree with that including an MLP in the C-Abstractor makes it unfair to compare it to MLP. MLP is also included in Resampler. Does that make MLP and Resampler incomparable as well? How can we compare them? MLP is a basic building block in deep learning model architectures.

D-Abstractor is the resampler-based approach. We tried to solve the issue of resampler based on resampler architecture in D-Abstractor. In D-Abstractor, we have incorporated deformable attention to maintain the structure of the Resampler while addressing the locality preservation issue. Please refer to our paper for more details.

Thanks for the dedicated analysis.

Do u think combine CAbstracter with Resampler instead using DeformableDecoderLayer would make Resampler single along get more informative resuts?

For instances, from Vit output 576 tokens, make CAbstracter outputs selected 256, and then using Resampler output 114 etc?

@lucasjinreal I don't think that introducing a resampler will benefit the performance. Resamplers are challenging to train effectively, with many, like Q-Former, simply initialized with a BERT-pretrained checkpoint. Our primary objective is to construct a straightforward projector between the Visual Encoder and LLM so that it can minimize information loss btw different modalities. While MLP is a good approach, it struggles to efficiently compress image tokens due to the redundancy in image features compared to text features. C-Abstrator, however, is one method capable of both condensing image features and mitigating information loss. Introducing another Resampler on top of it doesn't sound reasonable.

@NormXU Hi, I do have managed using Resampler achieved a promising result, comparable with MLP with extremly limited token. The result so far so good. But the real problem in resampler, is hardly to do very detailed recognition, but very good at reasoning. that's why I ask, if CAbstractor + Resampler, could a good choice or not.

From the aspect for eventually route in MLLM evolving, I think MLP might not be the final path, as they hardly (very hard) extended to video.

First of all, I agree with @NormXU. I think that sole Resampler and C-Abs + Resampler would result in similar outcomes. As you mentioned, Resampler (trained with common web-crawled image-text pairs) tends to focus on a salient object, resulting in a lack of understanding of spatial relationships or detailed objects. This issue will remain even with the combination of C-Abs and Resampler. I first recommend that using C-Abs or D-Abs instead of Resampler for the same number of visual tokens. However, if you are considering an "extremely" small number of tokens (e.g., 1 or 2 tokens per image) and still want detailed understanding, you might need a new architecture. For example, we could consider the query-aware projector (or visual encoder) to include the required details within a few visual tokens.

Does CAbstactor able to scale? Resampler had a good feature that if you deeper it, it will keep growth the performance, just feed more data

Yes, and partially no. In our experiments, scaling up to 50M pre-training samples (= 200k steps) does not impact the relative performance between C-Abs and Resampler. However, if you have not only large-scale but also high-quality datasets, this could change (it is the case that I mentioned above). Resampler needs to learn locality from data (due to its less inductive bias), thus the dataset should be high-quality including spatial relationships and detailed descriptions. I think if you rely on web-crawled quality datasets (e.g., LAION, COYO), Resampler would not scale well even with deeper layers and larger datasets.

Resampler needs to learn locality from data (due to its less inductive bias), thus the dataset should be high-quality including spatial relationships and detailed descriptions.

Do u have any insights why it happened like this? And actually I found resampler are not very good at learn ocr abilities.

It is due to its architecture with less inductive bias. Convolution is designed with a locality inductive bias, but resampler is not.