invictus717 / MetaTransformer

Meta-Transformer for Unified Multimodal Learning

Home Page:https://arxiv.org/abs/2307.10802

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the ``pretrain-finetune'' pipeline

yangbang18 opened this issue · comments

Hi, thanks for your great contributions.

I am curious about your ``pretrain-finetune'' pipeline.

According to the paper and your code, it seems that the pipeline is:

  1. you first carry out pre-training on LAION-2B with a CLIP-style objective to obtain a modality-agnostic encoder
  2. then you integrate a data-to-sequence tokenizer (whose implementation depends on the modality of downstream task) with the pre-trained encoder and fine-tune the model.

Am I understand right?

Here are my concerns:

  1. The core idea of MetaTransformer is one shared backbone + different tokenizers + different heads. However, I don't see any joint training process on data across 12 modalities. Instead, it seems that you carry out fine-tuning 12 times, where in some cases the so-called ``shared'' backbone needs to be trained to fit in a specific modality to obtain superior performance.
  2. According to the first concern, the demo you give in README may give inferior representations for modalities except for images, right? This is because the released pre-trained weights are obtained from the above step 1) pre-training, not the joint training on 12 modalities.

Thank you for your insightful questions, I would like to answer these questions as follows:

  1. the so-called ``shared'' backbone needs to be trained to fit in a specific modality.

No, it is not. The shared backbone is strictly frozen while tokenizers and downstream heads are adapted for the corresponding tasks. Meta-Transformer also provides a parameter-efficient learning paradigm across multiple modalities.

  1. " give inferior representations for modalities except for images"?

I would like to thank you for this question, which may also bother other readers. Indeed, we find that pre-training in LAION-2B is "powerful" enough to extract general purpose across 12 modalities. More specifically, even the 95-99% parameters are frozen on the unseen modalities, this paradigms can also work with fine-tuning tokenizers and lightweight heads. Therefore, I think that the extracted representation of Meta-Transformer encoder is generic for perception tasks on these modalities.

If you have any questions, please feel free to reach out. It's nice to discuss with you such insightful questions.

Thanks for your response.

I have one more question: the released pre-trained weights do not involve tokenizer weights (e.g., 2D convolution weights for processing images, 3D convolution weights for processing videos). Do I overlook sth?

BTW, the demo code given in README is incomplete. It would be better to include an example video, image, and audio in the repo.

I like the above comment about including example video, image, and audio. Text would also be helpful too because I am getting error OSError: Can't load tokenizer for 'openai/clip-vit-large-patch14' when trying to execute text_tokenizer = d2s(modality='text', dim=768). Also, tokenizer is misspelled as tokenier in the README too

Thanks for these constructive suggestions!

Thanks for your response.

I have one more question: the released pre-trained weights do not involve tokenizer weights (e.g., 2D convolution weights for processing images, 3D convolution weights for processing videos). Do I overlook sth?

BTW, the demo code given in README is incomplete. It would be better to include an example video, image, and audio in the repo.

As we introduced in the paper, tokenizers can be directly applied with modality-specific pretraining. For example, you can introduce MAE or Video MAE pretrained tokenizers with Meta-Transformer, which can even deliver better performance than train from scratch. Meanwhile, Meta-Transformer can also deliver better performance on the unseen modalities.