3d contrastive-learning foundation-models multimodal-learning pytorch zero-shot-classification

Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training

This repository contains the official implementation of "MixCon3D" in our paper.

Overview of the MixCon3D. We integrate the image and point cloud modality information, formulating a holistic 3D instance-level representation for cross-modal alignment.

News

[2023.11.2] We release the training code of the MixCon3D.

Installation

Please refer to this instruction for step-by-step installation guidance. Both the necessary packages and some helpful debug experience are provided.

Data Downloading

First, modify the path in the download_data.py. Then, execute the following command to download data from Hugging Face:

python3 download_data.py

The datasets used for experiments are the same as OpenShape. Please refer to OpenShape for more details of the data.

Training

To train the PointBERT model using the MixCon3D, please change the [data_path] to your specified local data path. Then, running the following command:

On an 8 GPU server, training with batchsize 2048

torchrun --nproc_per_node=8 src/main.py --ngpu 8 dataset.folder=data_path dataset.train_batch_size=256 model.name=PointBERT model.scaling=3 model.use_dense=True --trial_name MixCon3D --config src/configs/train.yaml

If the GPU memory is not enough, you can lower the train_batch_size and run with a specific accumulated iter num as follows (The equivalent batchsize = train_batch_size * accum_freq):

torchrun --nproc_per_node=8 src/main.py --ngpu 8 dataset.folder=data_path dataset.train_batch_size=128 dataset.accum_freq=2 model.name=PointBERT model.scaling=3 model.use_dense=True --trial_name MixCon3D --config src/configs/train.yaml

To train on different datasets, use the following command:

(Ensemble_no_LVIS)

torchrun --nproc_per_node=8 src/main.py --ngpu 8 dataset.folder=data_path dataset.train_split=meta_data/split/train_no_lvis.json dataset.train_batch_size=128 dataset.accum_freq=2 model.name=PointBERT model.scaling=3 model.use_dense=True --trial_name MixCon3D --config src/configs/train.yaml

(ShapeNet_only)

torchrun --nproc_per_node=8 src/main.py --ngpu 8 dataset.folder=data_path dataset.train_split=meta_data/split/ablation/train_shapenet_only.json dataset.train_batch_size=128 dataset.accum_freq=1 model.name=PointBERT model.scaling=3 model.use_dense=True --trial_name MixCon3D --config src/configs/train.yaml

We use the wandb for logging.

Acknowledgment

This codebase is based on OpenShape, timm and PointBERT. Thanks to the authors for their awesome contributions! This work is partially supported by TPU Research Cloud (TRC) program and Google Cloud Research Credits program.

Citation

@inproceedings{gao2023mixcon3d,
  title     = {Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training},
  author    = {Yipeng Gao and Zeyu Wang and Wei-Shi Zheng and Cihang Xie and Yuyin Zhou},
  booktitle = {CVPR},
  year      = {2024},
}

Contact

Questions and discussions are welcome via gaoyp23@mail2.sysu.edu.cn or open an issue here.

About

[CVPR 2024] The official implementation of paper "Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training"

3d contrastive-learning foundation-models multimodal-learning pytorch zero-shot-classification

Languages

Language:Python 100.0%