UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

The source code for "UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All" (CVPR 2024).

Requirements

Clone the repository:

git clone https://github.com/qc-ly/UniBind

cd UniBind

Create an environment:

conda create -n unibind python==3.9

conda activate unibind

Install the required packages:

conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.6 -c pytorch -c nvidia

conda install cartopy

pip install -r requirements.txt

Quick Start

Download the pre-trained weights from [link] to ./ckpts/
Download the centre embeddings from [link] to ./centre_embs/

Inference for 6 modalities: We have provided some sample data in ./assets/ to quickly see how UniBind works.

For image modality:

CUDA_VISIBLE_DEVICES=0 python demo_for_image.py

Output:

The categories are: ['folding chair', 'Shetland sheepdog, Shetland sheep dog, Shetland']

For audio modality:

CUDA_VISIBLE_DEVICES=0 python demo_for_audio.py

Output:

The categories are: ['airplane', 'car_horn']

For video modality:

CUDA_VISIBLE_DEVICES=0 python demo_for_video.py

Output:

The categories are: ['vehicles/autos', 'education']

For point cloud modality:

CUDA_VISIBLE_DEVICES=0 python demo_for_point.py

Output:

The categories are: ['airplane', 'car']

For thermal modality:

CUDA_VISIBLE_DEVICES=0 python demo_for_thermal.py

Output:

The categories are: ['person', 'background']

For event modality:

CUDA_VISIBLE_DEVICES=0 python demo_for_event.py

Output:

The categories are: ['gerenuk', 'sea_horse']

Zero-shot

Download the pre-trained weights from [link] to ./ckpts/
Download the centre embeddings from [link] to ./centre_embs/

Process the datasets in the following form:

|----｜datasets/
|----｜---｜ImageNet-1k/
|----｜---｜---｜train_dataset/
|----｜---｜---｜---0.jpg
|----｜---｜---｜---1.jpg
|----｜---｜---｜---...
|----｜---｜---｜eval_dataset/
|----｜---｜---｜test_dataset/
|----｜---｜---｜train_data.json
|----｜---｜---｜eval_data.json
|----｜---｜---｜test_data.json
|----｜---｜...
|----｜---｜ESC-50/
|----｜---｜---｜...

The data format of the test_data.json is shown as follows:

[
 {
   "data": data_file_name,
   "label": label,
 }
 ...
]

Running for zero-shot setting.
```
cd scripts
bash zero_shot.sh
```

Fine-tune

Download the pre-trained weights from [link] to ./ckpts/
Download the centre embeddings from [link] to ./centre_embs/

Generate the multi-modal data descriptions for your dataset via LLaMA-Adapter and GPT-4. Here we show the demo code for generating descriptions via LLaMA-Adapter:

import ImageBind.data as data
import llama
from tqdm import tqdm
import pickle
import json
import torch
llama_dir = "llama/llama_model_weights"
model = llama.load(LLaMA_weight_path, llama_dir, knn=True)
model.eval()
model.cuda()
descriptions = []
video_data_list = []
with open(video_meta_data_path, 'r') as f:
    data = json.load(f)
    for line in data:
        video_data_list.append(line['video'])
for i, video in enumerate(tqdm(video_data_list)):
    inputs = {}
    video_data = data.load_and_transform_video_data([video_path+video], device='cuda')
    inputs['Video'] = [video_data, 1]
    with torch.no_grad():
        results = model.generate(
            inputs,
            [llama.format_prompt(f"Describe the video.")],
            max_gen_len=77
        )
    result = results[0].strip()
    descriptions.append(result)

Process the datasets in the following form:

|----｜datasets/
|----｜---｜ImageNet-1k/
|----｜---｜---｜train_dataset/
|----｜---｜---｜---0.jpg
|----｜---｜---｜---1.jpg
|----｜---｜---｜---...
|----｜---｜---｜eval_dataset/
|----｜---｜---｜test_dataset/
|----｜---｜---｜train_data.json
|----｜---｜---｜eval_data.json
|----｜---｜---｜test_data.json
|----｜---｜...
|----｜---｜ESC-50/
|----｜---｜---｜...

The data format of the train_data.json is shown as follows:

[
  {
    "data": data_file_name,
    "description": descriptive_text,
    "label": label,
  }
  ...
]

The data format of the eval_data.json and test_data.json is shown as follows:

[
  {
    "data": data_file_name,
    "label": label,
  }
  ...
]

Running for fine-tune setting.
```
cd scripts
bash fine_tune.sh
```

Acknowledgement

Our codes are built on open-source codes, thanks to the following projects:

Thanks for their outstanding works and open-source!

Citation

If you find this repository useful, please consider giving stars ⭐ and citations

@inproceedings{girdhar2023imagebind,
  title={Imagebind: One embedding space to bind them all},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={15180--15190},
  year={2023}
}

@article{guo2023point,
  title={Point-bind \& point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following},
  author={Guo, Ziyu and Zhang, Renrui and Zhu, Xiangyang and Tang, Yiwen and Ma, Xianzheng and Han, Jiaming and Chen, Kexin and Gao, Peng and Li, Xianzhi and Li, Hongsheng and others},
  journal={arXiv preprint arXiv:2309.00615},
  year={2023}
}

@article{zhang2023llama,
  title={Llama-adapter: Efficient fine-tuning of language models with zero-init attention},
  author={Zhang, Renrui and Han, Jiaming and Zhou, Aojun and Hu, Xiangfei and Yan, Shilin and Lu, Pan and Li, Hongsheng and Gao, Peng and Qiao, Yu},
  journal={arXiv preprint arXiv:2303.16199},
  year={2023}
}

@article{han2023imagebind,
  title={Imagebind-llm: Multi-modality instruction tuning},
  author={Han, Jiaming and Zhang, Renrui and Shao, Wenqi and Gao, Peng and Xu, Peng and Xiao, Han and Zhang, Kaipeng and Liu, Chris and Wen, Song and Guo, Ziyu and others},
  journal={arXiv preprint arXiv:2309.03905},
  year={2023}
}

@article{lyu2024unibind,
  title={UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All},
  author={Lyu, Yuanhuiyi and Zheng, Xu and Zhou, Jiazhou and Wang, Lin},
  journal={arXiv preprint arXiv:2403.12532},
  year={2024}
}

Contact

If you have questions, suggestions, and bug reports, please email:

lvyuanhuiyi@foxmail.com

hmlatapie / UniBind

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Requirements

Quick Start

Zero-shot

Fine-tune

Acknowledgement

Citation

Contact

About

Languages