smksyj / ONE-PEACE

A general representation modal across vision, audio, language modalities.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool



Paper   |   Demo   |   Checkpoints   |   Datasets


ONE-PEACE is a general representation model across vision, audio, and language modalities, Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, audio, audio-language, and vision-language tasks. Furthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities that are not paired in the training data.

Below shows the architecture and pretraining tasks of ONE-PEACE. With the scaling-friendly architecture and modality-agnostic tasks, ONE-PEACE has the potential to expand to unlimited modalities.

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

News

  • 2023.5.25: Released the easy-to-use API, which enables the quick extraction for image, audio and text representations.
  • 2023.5.23: Released the pretrained checkpoint, as well as finetuning & inference scripts for vision-language tasks.
  • 2023.5.19: Released the paper and code. Pretrained & finetuned checkpoints, training & inference scripts, as well as demos will be released as soon as possible.

Models and Results

Model Card

We list the parameters and pretrained checkpoint of ONE-PEACE below.

ModelCkptParamsHidden sizeIntermediate sizeAttention headsLayers
ONE-PEACEDownload4B153661442440

Results

Vision Tasks

TaskImage classificationSemantic SegmentationObject Detection (w/o Object365)Video Action Recognition
DatasetImagenet-1KADE20KCOCOKinetics 400
Splitvaltesttesttest
MetricAcc.mIoUss / mIoUmsAPbox / APmaskTop-1 Acc. / Top-5 Acc.
ONE-PEACE89.862.0 / 63.060.4 / 52.988.1 / 97.8

Audio(-language) Tasks

TaskAudio-Text RetrievalAudio ClassificationAudio Question Answering
DatasetAudioCapsClothoESC-50FSD50KVGGSound (Audio Only)AVQA (Audio + Question)
Splittestevaluationfullevaltestval
MetricT2A R@1A2T R@1T2A R@1A2T R@1Zero-shot Acc.MAPAcc.Acc.
ONE-PEACE42.551.022.427.191.869.759.686.2

Vision-Language Tasks

TaskImage-Text Retrieval (w/o ranking)Visual GroundingVQAVisual Reasoning
DatasetCOCOFlickr30KRefCOCORefCOCO+RefCOCOgVQAv2NLVR2
Splittesttestval / testA / testBval / testA / testBval-u / test-utest-dev / test-stddev / test-P
MetricI2T R@1T2I R@1I2T R@1T2I R@1Acc@0.5Acc.Acc.
ONE-PEACE84.165.497.689.692.58 / 94.18 / 89.2688.77 / 92.21 / 83.2389.22 / 89.2782.6 / 82.587.8 / 88.3


Requirements and Installation

  • Python >= 3.7
  • Pytorch >= 1.10.0 (recommend 1.13.1)
  • CUDA Version >= 10.2 (recommend 11.6)
  • Install required packages:
git clone https://github.com/OFA-Sys/ONE-PEACE
pip install -r requirements.txt
  • For faster training install Apex library (recommended but not necessary):
git clone https://github.com/NVIDIA/apex
cd apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam"
  • Install Xformers library to use Memory-efficient attention (recommended but not necessary):
conda install xformers -c xformers
  • Install FlashAttention library to use faster LayerNorm (recommended but not necessary):
git clone --recursive https://github.com/HazyResearch/flash-attention
cd flash-attn && pip install .
cd csrc/layer_norm && pip install .

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Usage

API

We provide a simple code snippet to show how to use the API for ONE-PEACE. We use ONE-PEACE to compute embeddings for text, images, and audio, as well as their similarities:

import torch
from one_peace.models import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"
# "ONE-PEACE" can also be replaced with ckpt path
model = from_pretrained("ONE-PEACE", device=device, dtype="float32")

# process raw data
src_tokens = model.process_text(["cow", "dog", "elephant"])
src_images = model.process_image(["dog.JPEG", "elephant.JPEG"])
src_audios, audio_padding_masks = model.process_audio(["cow.flac", "dog.flac"])

with torch.no_grad():
    # extract normalized features
    text_features = model.extract_text_features(src_tokens)
    image_features = model.extract_image_features(src_images)
    audio_features = model.extract_audio_features(src_audios, audio_padding_masks)

    # compute similarity
    i2t_similarity = image_features @ text_features.T
    a2t_similarity = audio_features @ text_features.T

print("Image-to-text similarities:", i2t_similarity)
print("Audio-to-text similarities:", a2t_similarity)

Training & Inference

In addition to the API, we also provide the instructions of training and inference in getting_started.



Gallery

Visual Grounding (unseen domain)

grounding

Emergent Zero-shot Retrieval

a2i

a+t2i

a+i2i

Related Codebase

Getting Involved

Feel free to submit Github issues or pull requests. Welcome to contribute to our project!

To contact us, never hestitate to send an email to zheluo.wp@alibaba-inc.com or saimeng.wsj@alibaba-inc.com!

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{ONEPEACE,
  title={ONE-PEACE: Exploring one general Representation Model toward unlimited modalities},
  author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},
  journal={arXiv preprint arXiv:2305.11172},
  year={2023}
}

About

A general representation modal across vision, audio, language modalities.

License:Apache License 2.0


Languages

Language:Python 94.7%Language:Shell 2.8%Language:Cuda 1.6%Language:C++ 0.5%Language:Cython 0.2%Language:Lua 0.1%Language:Perl 0.0%Language:C 0.0%Language:Batchfile 0.0%Language:Makefile 0.0%