lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vit MAE reconstruction size mismatch

RhinigtasSalvex opened this issue · comments

I'm trying to Train ViT with Masked Autoencoder training but I'm getting an error when running MAE.forward()
The tensor size of the predicted pixel values is of by a factor of 4 in comparison to the masked_patches tensor in the MSE_loss call.

RuntimeError: The size of tensor a (1024) must match the size of tensor b (4096) at non-singleton dimension 2

I've tried different settings but the factor 4 size mismatch stays.

I've also tried a hack to fix the predicted pixel values size by adding a factor 4 to the to_pixels output layer neuron count.
This fixes the problem in the MSE_loss call but introduces a new one, namely: The gradients don't match up in the backward call.

RuntimeError: Function MmBackward returned an invalid gradient at index 1 - got [4096, 1024] but expected shape compatible with [1024, 1024]

But now I don't know how to debug further.

my last settings where:

'model': {
'encoder_depth': 5,
'decoder_depth': 5,
'patch_size': 32,
'num_classes': 1000,
'channels': 1,
'dim': 1024,
'heads': 8,
'mlp_dim': 2048,
'masking_ratio': 0.75,
'decoder_dim': 512,
},

Hi Rhinigtas! Could you show what your full training script looks like? Perhaps I can spot the error more easily that way

Hi Lucidrains, I've uploaded a stripped down version of my training script.

vit_train_tmp.txt