invictus717 / MetaTransformer

Meta-Transformer for Unified Multimodal Learning

Home Page:https://arxiv.org/abs/2307.10802

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code for Tokenization?

s4lome opened this issue · comments

Thank you for sharing this most exciting work!

I would like to know: Is the code for tokenizing different modalities not released yet or am I failing to read where in the code the tokenization happens?

I would like to use Meta Transformer on a custom Data Set, with image and text inputs.

As far as I understood the workflow would be:

token_text, token_image = tokenize(text), tokenize(image)

embedding_text = pretrained_encoder(token_text)  # as described in demo
embedding_image = pretrained_encoder(token_image)  # as described in demo

downstream_task(embedding_text, embedding_image) 

Is this correct on a very high level?

Thanks in advance!

Thank you for your interest in Meta-Transformer. The tokenization part will be released in 1-2 days, and I've worked on this for about 10 days, which I hope could be easy to use. On the custom dataset, your pseudo code is accurate.

If you have additional questions, please feel free to let me know, and I'm willing to offer my help~