VL-GPT

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Project Termination

We regret to inform that the project (VL-GPT) has been terminated. Unfortunately, the authors Jinguo and Xiaohan left the company and did not manage to refactor the codebase before their checkout. As a result, the source code and weights for this work cannot be released.

However, the main contribution from this work, an image tokenizer with continuous embedding and applying it in Large Multimodal Model, has also been adopted in another project within our team called SEED-X, which has been made open source already. We recommend to refer to the SEED-X project for insights and implementation details.

We sincerely apologize for not being able to release this work as an open-source project. Thank you for your understanding.

Introduction

Jinguo Zhu^1*, Xiaohan Ding^2*, Yixiao Ge², Yuying Ge²,
Sijie Zhao², Hengshuang Zhao³, Xiaohua Wang¹, Ying Shan²

¹ Xi'an Jiaotong University ² Tencent AI Lab ³ The University of Hong Kong
^* Equal Contribution

VL-GPT is a generative pre-trained transformer model for vision and language understanding and generation tasks, which can perceive and generate visual and linguistic data concurrently. By employing a straightforward auto-regressive objective, VL-GPT achieves a unified pre-training for both image and text modalities.
We also propose an image tokenizer-detokenizer framework for the conversion between raw images and continuous visual embeddings, analogous to the role of the BPE tokenization in language models.

License

This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.

About

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Apache License 2.0