VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
We regret to inform that the project (VL-GPT) has been terminated. Unfortunately, the authors Jinguo and Xiaohan left the company and did not manage to refactor the codebase before their checkout. As a result, the source code and weights for this work cannot be released.
However, the main contribution from this work, an image tokenizer with continuous embedding and applying it in Large Multimodal Model, has also been adopted in another project within our team called SEED-X, which has been made open source already. We recommend to refer to the SEED-X project for insights and implementation details.
We sincerely apologize for not being able to release this work as an open-source project. Thank you for your understanding.
Sijie Zhao2, Hengshuang Zhao3, Xiaohua Wang1, Ying Shan2
* Equal Contribution
-
VL-GPT is a generative pre-trained transformer model for vision and language understanding and generation tasks, which can perceive and generate visual and linguistic data concurrently. By employing a straightforward auto-regressive objective, VL-GPT achieves a unified pre-training for both image and text modalities.
-
We also propose an image tokenizer-detokenizer framework for the conversion between raw images and continuous visual embeddings, analogous to the role of the BPE tokenization in language models.
This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.