InDomainGeneralizationBenchmark

We propose a simple generalization benchmark with various systematic out-of-distribution test splits (composition, interpolation and extrapolation). This procedure is visualized in the figure below.

Fig. 1: In the four scatter plots, we see various splits along the generative factors of variations for the dSprites dataset. The axes correspond to factors of variation in the data, i.e., scale as visualized for extrapolation on the right.

Datasets

We consider the dSprites, Shapes3D and MPI3D-Real dataset. The splits corresponding to random, composition, interpolation and extrapolation can be found at dSprites splits, Shapes3D splits, MPI3D splits.

Training

In this benchmark, we allow for a wide variety of modelling approaches and also leveraging external data. Furthermore, a practitioner can sample from the training data in whatever way is optimal for the learning algorithm. For instance, this enables various supervision types from unsupervised, weakly-supervised, supervised to transfer-learning. However, the test set should remain untouched and can only be used for evaluation.

Evaluation

The random, composition and interpolation splits can be used for hyperparameter tuning. The final evaluation and ranking can be done on the extrapolation setting. Please submit a pull request with an updated leaderboard to include novel results.

Evaluating your model on this benchmark can be done with as little as 3 lines of code:

import lablet_generalization_benchmark as lgb
import numpy as np


def model_fn(images: np.ndarray)->np.ndarray:
    # integrate your tensorflow, pytorch, jax model here
    predictions = model(images)
    return predictions

dataloader = lgb.load_dataset('shapes3d', 'extrapolation', mode='test')
# get dictionary of r2 and mse per factor
score = lgb.evaluate_model(model_fn, dataloader)

We use the R2 metric for evaluation and ranking models.

MPI3D Leaderboard

Method	Reference	R2 score Extrapolation
RN50 (ImageNet-21k)	Kolesnikov et al.	54.1%
RN101 (ImageNet-21k)	Kolesnikov et al.	41.6%
PlaceHolder3	placeholder	--%
PlaceHolder4	placeholder	--%

Shapes3D Leaderboard

Method	Reference	R2 score Extrapolation
RN101	He et al.	67.8%
RN50	He et al.	62.8%
PlaceHolder3	placeholder	--%
PlaceHolder4	placeholder	--%

dSprites Leaderboard

Method	Reference	R2 score Extrapolation
PCL	Hyvärinen et al.	66.7%
DenseNet121	Huang et al.	64.4%
PlaceHolder3	placeholder	--%
PlaceHolder4	placeholder	--%

Citation

Please cite our paper at

@misc{schott2021visual,
title={Visual Representation Learning Does Not Generalize Strongly Within the Same Domain},
author={Lukas Schott, Julius von Kügelgen, Frederik Träuble, Peter Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, Wieland Brendel},
year={2021},
eprint={2107.08221},
archivePrefix={arXiv},
primaryClass={cs.LG}
}.

bethgelab / InDomainGeneralizationBenchmark