ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding (CVPR2024)

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding (CVPR2023)

Official implementation of ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding

Official implementation of ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Project Website

News

[06/20/2024] ULIP-2 upgraded pre-trained 3d backbone uploaded (scaled-up backbone, support colored point clouds) is released here.

[02/26/2024] "ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding is accepted to CVPR2024!"

[06/09/2023] "PointBERT ULIP-2 pretrained model released, please find it in the here".

[06/09/2023] A smaller version of "ULIP - ShapeNet Triplets" are released at here, it's around 420GB now. Check this image folder "only_rgb_depth_images", you can choose to download this subset of rendered images, which are the exact images leveraged by ULIP instead of downloading the full "rendered_images" folder (more than 1TB).

[05/22/2023] "ULIP - Objaverse Triplets" and "ULIP - ShapeNet Triplets" have been uploaded here.

[05/14/2023] ULIP-2 has been released!

[02/28/2023] ULIP has been accepted by CVPR 2023! 🔥🔥🔥

Animation

What is ULIP

ULIP is a Model-agnostic Multimodal Pre-training Framework, which can leverage information from other modalities (Images, Language) to improve the ability to understand 3D data without introducing any extra latency.

Pipeline

Instructions

ULIP is a highly extensible multimodal pre-training framework, and it's model-architecture agnostic, meaning you can easily plug in any 3D backbone models and pre-train it using our framework to get a jump-start for various downstreaming tasks!

[Install environments]

We pre-train ULIP on 8 Nvidia A100 GPUs, the code is tested with CUDA==11.0 and pytorch==1.10.1
conda create -n ulip python=3.7.15
conda activate ulip
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt

[optional]
If you want to pre-train PointNeXt, we embed a modified PointNeXt codebase inside the ./models/pointnext, please do the following to install it:

cd ./models/pointnext/PointNeXt \
bash update.sh \
bash install.sh \

[Download datasets and initialize models, put them in the right paths.]

Download the used datasets and initialize models from here. For now, you ONLY need to download "initialize_models", "modelnet40_normal_resampled", and "shapenet-55". You might need a gmail account to access it.
After you download the datasets and initialize models, you can choose one of the following options:
(1) Put it in or do a soft link to the data folder, by default the data folder should have the following structure:

./data |
-- ModelNet40.yaml |
-- ShapeNet-55.yaml |
-- dataset_3d.py |
-- dataset_catalog.json |
-- initialize_models |
-- labels.json |
-- modelnet40_normal_resampled |
-- shapenet-55 |
-- templates.json|
-- objaverse-lvis

(2) Change the paths accordingly (optional to do if you don't want to put/link downloaded files in the data folder):

# Change the "DATA_PATH", "PC_PATH", "IMAGE_PATH"
./data/ShapeNet-55.yaml
# Change the "DATA_PATH"
./data/ModelNet40.yaml
# Change the initialize_models address
./models/ULIP_models.py
Modify this line "pretrain_slip_model = torch.load('./data/initialize_models/slip_base_100ep.pt', map_location=torch.device('cpu'))"

[Pre-train 3D backbones]

Our framework is model architecture agonistic, currently four 3D backbones are supported:
Pointnet2(ssg)
PointBERT
PointMLP
PointNeXt

Please change the script to accommodate your system accordingly, this script is used to pre-train on 8 gpus by default. You can also modify the desired output folder in the script.

# the scripts are named by its correspoinding 3D backbone name.
bash ./scripts/(choose your pre-train script)

[Test pre-trained models for zero-shot classification on ModelNet40]

You may also change the output path in the scripts as well.

bash ./scripts/(choose your test script) /path/to/your/checkpoint.pt

You may also change the output path in the scripts as well.

[Pre-train & Test using different number of points]

Change the npoints argument in the scripts, by default its 8192.
Note: Currently we use FPS to subsample the 8192 points, which might slow down the training speed. If you'd like, you can choose to cache or save the pre-processed datasets with different number of points to speed up your pre-training.

[Pre-train your customized 3D backbones]

There are only two things you need to change to pre-train your own customized 3D backbones:
(1) Define your own 3D backbone in ./models folder.
We put a template "customized_backbone" here, you can refer to the comments to see the expected input and output shapes. You can also refer to how pointnet2 is defined here.
(2) Use or modify this "ULIP_CUSTOMIZED" class in ./models/ULIP_models.py.
Please refer to the comments in "ULIP_CUSTOMIZED" class, it should be straightforward to follow, and please be sure to change the "pc_feat_dims" accordingly (since we are agnostic to the point cloud output feature dimensions of your customized 3D backbones).

Pre-trained models for zero-shot classification

Zero-shot classification on ModelNet40, 8k points pre-train, 8k points test, best checkpoint:

model	top1	top5
Pointnet2(ssg)	57.7	78.9
PointMLP	60.0	79.4
PointBERT	60.3	84.0
PointNeXt	56.2	77.0
PointBERT_ULIP-2(xyz input)	75.6	93.7

ULIP-2

To ensure a fair comparison, we use the same test sets from modelnet40 and objaverse-lvis preprocessed by OpenShape, which contains 10k colored point clouds. You can either follow openshape's repo to prepare the data, or download it from our bucket, we duplicated one copy for your convenience.

Extra instructions for running ULIP-2 upgraded pre-trained models:
make sure open_clip is installed.
pip install open_clip_torch
download the checkpoint
for both the modelnet40 and objaverse-lvis, prepare the 10k colored point clouds test sets either from OpenShape, or download a copy directly from our gcp bucket. By default, it expects a folder named objaverse-lvis in the data folder, the same structure as shown here in our gcp bucket, and two more files added to the modelnet40_normal_resampled folder under data folder.

Running evaluation:
for 10k colored point clouds modelnet40 using ULIP-2:
bash scripts/test_ulip2_pointbert_modelnet40.sh /path/to/ckpt
for 10k colored point clouds objaverse-lvis using ULIP-2:
bash scripts/test_ulip2_pointbert_objaverse_lvis.sh ./ULIP-2-PointBERT-10k-colored-pc-pretrained.pt

model	modelnet40 top1	modelnet40 top5	objaverse-lvis top1	objaverse-lvis top5
PointBERT_ULIP-2(10k_xyzrgb input_scaled_up)(32.5M)	84.1	97.3	50.6	79.1

note that, during the clean-up of the code for the release, some randomness might have changed, we noticed a minor fluctuation in the performance using our current server, in our paper, the same model on modelnet40 (~2.5k samples) is 84.7 and 97.1, after cleaning it is 84.1 and 97.3; on objaverse-lvis (~46k samples), in the paper is 50.634 and 79.054, after the cleaning it is 50.629 and 79.052. Since objaverse-lvis has more samples and thus is more robust, we recommend using it for benchmarking.

TODO

More supported backbones will be released soon.

License and term of use for the released pre-train datasets

The code is under https://github.com/salesforce/ULIP/blob/main/LICENSE.txt.

The released "ULIP - Objaverse Triplets" is under https://opendatacommons.org/licenses/by/1-0/, consistent with Objaverse's license.

The released "ULIP - ShapeNet Triplets" is under the terms of use from https://shapenet.org/terms, consistent with ShapeNet's terms of use.

Citation

@article{xue2022ulip,
  title={ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding},
  author={Xue, Le and Gao, Mingfei and Xing, Chen and Mart{\'\i}n-Mart{\'\i}n, Roberto and Wu, Jiajun and Xiong, Caiming and Xu, Ran and Niebles, Juan Carlos and Savarese, Silvio},
  journal={arXiv preprint arXiv:2212.05171},
  year={2022}
}
@misc{xue2023ulip2,
  title={ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding}, 
  author={Le Xue and Ning Yu and Shu Zhang and Junnan Li and Roberto Martín-Martín and Jiajun Wu and Caiming Xiong and Ran Xu and Juan Carlos Niebles and Silvio Savarese},
  year={2023},
  eprint={2305.08275},
  archivePrefix={arXiv},
  primaryClass={cs.CV}

}

Contact

If you have any question about this project, please contact lxue@salesforce.com

salesforce / ULIP