Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir
Official PyTorch implementation and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders.
We introduce Multi-modal Multi-task Masked Autoencoders (MultiMAE), an efficient and effective pre-training strategy for Vision Transformers. Given a small random sample of visible patches from multiple modalities, the MultiMAE pre-training objective is to reconstruct the masked-out regions. Once pre-trained, a single MultiMAE encoder can then be used for both single-modal and multi-modal downstream transfer, yielding competitive to or significantly better results than the baselines.
- Pre-trained models
- MultiMAE pre-training code
- ImageNet-1K classification fine-tuning code
- Semantic segmentation fine-tuning code (single-modal & multi-modal)
- Depth estimation fine-tuning code
- Taskonomy fine-tuning code
- Colab & Hugging Face demos
We provide the weights of our pre-trained MultiMAE ViT-B model, in MultiViT (multi-modal) format and timm (RGB-only) format.
For comparison, we also provide the weights of a MAE ViT-B model that we pre-trained using the official MAE codebase following the recommended settings.
Method | Arch. | Pre-training modalities |
Pre-training epochs |
Weights (MultiViT) |
Weights (timm) |
Config |
---|---|---|---|---|---|---|
MAE | ViT-B | RGB | 1600 | download | download | See MAE |
MultiMAE | ViT-B | RGB+D+S | 1600 | download | download | link |
These pre-trained models can then be fine-tuned using this codebase to reach the following performance:
Method | Classif. (@1) | Semantic Segmentation (mIoU) | Depth (δ1) | ||||||
---|---|---|---|---|---|---|---|---|---|
ImageNet-1K (RGB) |
ADE20K (RGB) |
Hypersim (RGB / D / RGB + D) |
NYUv2 (RGB / D / RGB + D) |
NYUv2 (RGB) |
|||||
Sup. (DeiT) | 81.8 | 45.8 | 33.9 | - | - | 50.1 | - | - | 80.7 |
MAE | 83.3 | 46.2 | 36.5 | - | - |
50.8 | - | - | 85.1 |
MultiMAE | 83.3 | 46.2 | 37.0 | 38.5 | 47.6 | 52.0 | 41.4 | 56.0 | 86.4 |
We provide pre-trained weights in two different formats: the single-modal ViT / timm format, which is compatible with other popular ViT repositories (e.g., timm, DINO, MAE), and the multi-modal MultiMAE / MultiViT format, which is used throughout this codebase for multi-modal pre-training and fine-tuning. See multimae/multimae.py
for the documentation and implementation of MultiMAE / MultiViT.
You can convert between these formats using the provided vit2multimae_converter.py
and multimae2vit_converter.py
scripts.
See SETUP.md for set-up instructions.
See PRETRAINING.md for pre-training instructions.
See FINETUNING.md for fine-tuning instructions.
For interactive demos, please see our website
. Open our Colab notebook
to play around with the visualization code, or simply upload an image to our Hugging Face Spaces demo
.
This repository is built using the timm, DeiT, DINO, MoCo v3, BEiT, MAE-priv, and MAE repositories.
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
If you find this repository helpful, please consider citing our work:
@article{bachmann2022multimae,
author = {Roman Bachmann and David Mizrahi and Andrei Atanov and Amir Zamir},
title = {{MultiMAE}: Multi-modal Multi-task Masked Autoencoders},
journal = {arXiv preprint arXiv:2204.01678},
year = {2022},
}