SkyworkAI / Vitron

A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Home Page:https://vitron-llm.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Hao Fei$^{1,2}$, Shengqiong Wu$^{1,2}$, Hanwang Zhang$^{1,3}$, Tat-Seng Chua$^{2}$, Shuicheng Yan$^{1}$

โ–ถ $^{1}$ Skywork AI, Singapore โ–ถ $^{2}$ National University of Singapore โ–ถ $^{3}$ Nanyang Technological University

License YouTube

๐Ÿ“ฐ News

  • [2024.04.04] ๐Ÿ‘€๐Ÿ‘€๐Ÿ‘€ Our Vitron is available now! Welcome to watch ๐Ÿ‘€ this repository for the latest updates.

๐Ÿ˜ฎ Highlights

Existing vision LLMs might still encounter challenges such as superficial instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks. To fill the gaps, we present Vitron, a universal pixel-level vision LLM, designed for comprehensive understanding (perceiving and reasoning), generating, segmenting (grounding and tracking), editing (inpainting) of both static image and dynamic video content.

vitron

๐Ÿ› ๏ธ Requirements and Installation

  • Python >= 3.8
  • Pytorch == 2.1.0
  • CUDA Version >= 11.8
  • Install required packages:
git clone https://github.com/SkyworkAI/Vitron
cd Vitron
conda create -n vitron python=3.10 -y
conda activate vitron
pip install --upgrade pip 
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install decord opencv-python git+https://github.com/facebookresearch/pytorchvideo.git@28fe037d212663c6a24f373b94cc5d478c8c1a1d
๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Installation or Running Fails? ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ
  1. When running ffmpeg, Unknown encoder 'x264':

    • try to re-install ffmpeg:
    conda uninstall ffmpeg
    conda install -c conda-forge ffmpeg   # `-c conda-forge` can not omit
    
  2. Fail to install detectron2, try this command:

    python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'
    

    or refer this Website.

  3. Error in gradio. As there are a big update in gradio>=4.0.0, please make sure install gradio with the same verion in requirements.txt.

  4. Error with deepspeed. If you fine-tune your model, this error occours:

    FAILED: cpu_adam.so
    /usr/bin/ld: cannot find -lcurand
    

    This error is caused by the wrong soft links when installing deepspeed. Please try to the following command to solve the error:

    cd ~/miniconda3/envs/vitron/lib
    ls -al libcurand*  # check the links
    rm libcurand.so   # remove the wrong links
    ln -s libcurand.so.10.3.5.119 libcurand.so  # build new links
    

    Double check again:

    python 
    from deepspeed.ops.op_builder import CPUAdamBuilder
    ds_opt_adam = CPUAdamBuilder().load()  # if loading successfully, then deepspeed are installed successfully.
    

๐Ÿ‘ Deploying Gradio Demo

  • Firstly, you need to prepare the checkpoint, and then you can run the demo locally via:
python app.py

๐Ÿ™Œ Related Projects

You may refer to related work that serves as foundations for our framework and code repository, Vicuna, SEEM, i2vgenxl, StableVideo, and Zeroscope. We also partially draw inspirations from Video-LLaVA, and LanguageBind. Thanks for their wonderful works.

๐Ÿ”’ License

  • The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
  • The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violation.

โœ๏ธ Citation

If you find our paper and code useful in your research, please consider giving a star โญ and citation ๐Ÿ“.

@articles{hao2024vitron,
  title={Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing},
  author={Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan},
  journal={CoRR},
  year={2024}
}

โœจ Star History

Star History

About

A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

https://vitron-llm.github.io/


Languages

Language:Python 97.6%Language:Cuda 2.1%Language:C++ 0.2%Language:Shell 0.1%Language:CSS 0.0%Language:Dockerfile 0.0%