Chat-3D v2

This is an official repo for paper "Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers". [paper]

News

[2024.04] 🔥 A refined implementation of Chat-3D v2 is released. The old version v2.0 has been archived in branch v2.0. This main branch is now for the new version (v2.1).

[2024.01] Update training guide for grounding on ScanRefer.

[2023.12] Code release. The main training architecture is based on our former work Chat-3D.

🔥 v2.1 vs v2.0

Performance comparison

	ScanRefer		ScanQA		Scan2Cap		Multi3dRefer		SQA3D
	Acc@0.25	Acc@0.5	CIDEr	B-4	CIDEr@0.5	B-4@0.5	F1@0.25	F1@0.5	EM
v2.0	35.9	30.4	77.1	7.3	28.1	15.5	-	-	-
v2.1	42.5	38.4	87.6	14.0	63.9	31.8	45.1	41.6	54.7

All results of v2.1 are evaluated on the same model without finetuning on specific tasks.

Main changes
- LLM backbone: Vicuna v0 -> Vicuna v1.5 + LoRA finetuning
- Training scheme: three-stage training -> one-stage joint training
- Segmentor: PointGroup -> Mask3D
- Code Optimization:
  - batch size: 1 -> 32
  - Simpler training and evaluating process

🔨 Preparation

Prepare the environment:

(Different from v2.0)

conda create -n chat-3d-v2 python=3.9.17
conda activate chat-3d-v2
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Download LLM backbone:
- We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
- Change the llama_model_path in config.py to the location of vicuna-7b-v1.5.
Annotations and extracted features:

Please follow the instructions in preprocess.

🤖 Training and Inference

Training
- Modify run.sh:
```
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=False
```
  Explanation of "train_tag" and "val_tag"
  - Use # to seperate different datasets
  - Datasets:
    - scanrefer: ScanRefer Dataset
    - scan2cap: Scan2Cap Dataset
    - scanqa: ScanQA Dataset
    - sqa3d: SQA3D Dataset
    - multi3dref: Multi3dRefer Dataset
    - nr3d_caption: A captioning dataset originated from Nr3D.
    - obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
  - You can try different combination of training datasets or add costumized datasets.
- Run: bash scripts/run.sh
- Brief training info:
  
  Batch Size GPU VRAM Usage per GPU Training Time ckpt
  
  32 4 * A100 ~ 70 GB ~ 8 hours Google Drive
  
  1 1 * A100 ~ 28 GB ~ 3 days -

Batch Size	GPU	VRAM Usage per GPU	Training Time	ckpt
32	4 * A100	~ 70 GB	~ 8 hours	Google Drive
1	1 * A100	~ 28 GB	~ 3 days	-

Inference

Modify run.sh: (We provide the pretrained checkpoint in Google Drive)

val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"

Run: bash scripts/run.sh

📄 Citation

If you find this project useful in your research, please consider cite:

@article{huang2023chat,
  title={Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers},
  author={Huang, Haifeng and Wang, Zehan and Huang, Rongjie and Liu, Luping and Cheng, Xize and Zhao, Yang and Jin, Tao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2312.08168},
  year={2023}
}
@article{wang2023chat,
  title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
  author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
  journal={arXiv preprint arXiv:2308.08769},
  year={2023}
}

Stay tuned for our project. 🔥

If you have any questions or suggestions, feel free to drop us an email (huanghaifeng@zju.edu.cn, wangzehan01@zju.edu.cn) or open an issue.

😊 Acknowledgement

Thanks to the open source of the following projects:

LLMs: LLaMA, Vicuna

3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer

3D Segmentors: PointGroup, Mask3D

3D Encoders: ULIP, Uni3D

Multi-modal LLMs: VideoChat, LEO

3D Expert Models: vil3dref

jedyang97 / Chat-3D-v2