Rotary embeddings do not work properly in eva02_base_patch14_448.mim_in22k_ft_in1k

Question

Rotary embeddings do not work properly in eva02_base_patch14_448.mim_in22k_ft_in1k

bobyard-com opened this issue 3 months ago · comments

Describe the bug

  File "C:\Users\Bobyard\miniconda3\envs\py310\lib\site-packages\timm\layers\pos_embed_sincos.py", line 214, in apply_rot_embed_cat
    return x * cos_emb + rot(x) * sin_emb
RuntimeError: The size of tensor a (768) must match the size of tensor b (64) at non-singleton dimension 2

x.shape
Out[3]: torch.Size([BS, 1024, 768])
cos_emb.shape
Out[4]: torch.Size([1024, 64])

x.shape is due to PatchEmbed applied to the input of the model [BS, 3, 448, 448]
('patch_embed', PatchEmbed(
(proj): Conv2d(3, 768, kernel_size=(14, 14), stride=(14, 14))
(norm): Identity()
))
1024 in cos_emb comes from product of patch_embed.grid_size.shape (32x32=1024): feat_shape=None if dynamic_img_size else self.patch_embed.grid_size
64 in cos_emb comes from embed_dim // num_heads = 768 // 12 = 64

To Reproduce
Steps to reproduce the behavior:

timm.create_model('eva02_base_patch14_448.mim_in22k_ft_in1k', num_classes=1000, pretrained=True)
Some basic train loop

Ross Wightman · Answer 1 · Sat Apr 20 2024 11:00:48 GMT+0800 (China Standard Time)

@bobyard-com model works fine for me with builtin train scripts so you'll have to provide a lot more detail on 'basic train loop' and environment if you think it's the lib's issue...

Ross Wightman · Answer 2 · Sat Apr 20 2024 11:01:53 GMT+0800 (China Standard Time)

python train.py --dataset hfds/timm/resisc45 --amp -j 8 --model eva02_base_patch14_448.mim_in22k_ft_in1k --num-classes 45  -b 32 --pretrained --opt adamw --lr 1e-4
Training with a single process on 1 device (cuda).
Loading pretrained weights from Hugging Face hub (timm/eva02_base_patch14_448.mim_in22k_ft_in1k)
[timm/eva02_base_patch14_448.mim_in22k_ft_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
Missing keys (head.weight, head.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
Model eva02_base_patch14_448_mim_in22k_ft_in1k created, param count:86383149
Data processing configuration for current model + dataset:
	input_size: (3, 448, 448)
	interpolation: bicubic
	mean: (0.48145466, 0.4578275, 0.40821073)
	std: (0.26862954, 0.26130258, 0.27577711)
	crop_pct: 1.0
	crop_mode: center
Using native Torch AMP. Training in mixed precision.
PyTorch version 2.2.1 available.
Scheduled epochs: 300. LR stepped per epoch.
Train: 0 [   0/590 (  0%)]  Loss: 3.81 (3.81)  Time: 1.598s,   20.03/s  (1.598s,   20.03/s)  LR: 1.000e-05  Data: 0.423 (0.423)
Train: 0 [  50/590 (  8%)]  Loss: 3.67 (3.75)  Time: 0.646s,   49.52/s  (0.662s,   48.37/s)  LR: 1.000e-05  Data: 0.004 (0.013)
Train: 0 [ 100/590 ( 17%)]  Loss: 3.50 (3.67)  Time: 0.648s,   49.38/s  (0.655s,   48.89/s)  LR: 1.000e-05  Data: 0.005 (0.009)

bobyard-com · Answer 3 · Sun Apr 21 2024 06:08:46 GMT+0800 (China Standard Time)

Sorry, we figured it out, this is because you have custom forward function. Our code calls modules of the network sequentially in for loop. So RoPE is applied in attention and not as a first layer.