MultiMAE: Multi-modal Multi-task Masked Autoencoders

Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir

Website | arXiv | BibTeX

Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.

We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines.

Catalog

Pre-trained models
MultiMAE pre-training code
ImageNet-1K classification fine-tuning code
Semantic segmentation fine-tuning code (single-modal & multi-modal)
Depth estimation fine-tuning code
Taskonomy fine-tuning code
Colab & Hugging Face demos
Download links for ImageNet-1K depth and semantic segmentation pseudo labels

Pre-trained models

We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and timm (RGB-only) format.

For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the official MAE codebase following the recommended settings.

Method	Arch.	Pre-training modalities	Pre-training epochs	Weights (MultiViT)	Weights (timm)	Config
MAE	ViT-B	RGB	1600	download	download	See MAE
MultiMAE	ViT-B	RGB+D+S	1600	download	download	link

These pre-trained models can then be fine-tuned using this codebase to reach the following performance:

Method	Classif. (@1)	Semantic Segmentation (mIoU)							Depth (δ1)
	ImageNet-1K (RGB)	ADE20K (RGB)	Hypersim (RGB / D / RGB + D)			NYUv2 (RGB / D / RGB + D)			NYUv2 (RGB)
Sup. (DeiT)	81.8	45.8	33.9	-	-	50.1	-	-	80.7
MAE	83.3	46.2	36.5	-	-	50.8	-	-	85.1
MultiMAE	83.3	46.2	37.0	38.5	47.6	52.0	41.4	56.0	86.4

Model formats

We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., timm, DINO, MAE), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See multimae/multimae.py for the documentation and implementation of MultiMAE / MultiViT.

You can convert between these formats using the provided vit2multimae_converter.py and multimae2vit_converter.py scripts.

Usage

Set-up

See SETUP.md for set-up instructions.

Pre-training

See PRETRAINING.md for pre-training instructions.

Fine-tuning

See FINETUNING.md for fine-tuning instructions.

Demo & visualizations

For interactive demos, please see our website. Open our Colab notebook to play around with the visualization code, or simply upload an image to our Hugging Face Spaces demo.

Acknowledgement

This repository is built using the timm, DeiT, DINO, MoCo v3, BEiT, MAE-priv, and MAE repositories.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you find this repository helpful, please consider citing our work:

@article{bachmann2022multimae,
  author    = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
  title     = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
  booktitle = {European Conference on Computer Vision},
  year      = {2022},
}

About

MultiMAE: Multi-modal Multi-task Masked Autoencoders, ECCV 2022

https://multimae.epfl.ch

Other

Languages

Language:Python 99.9%Language:Shell 0.1%