huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Home Page:https://huggingface.co/docs/timm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Rotary embeddings do not work properly in eva02_base_patch14_448.mim_in22k_ft_in1k

bobyard-com opened this issue · comments

Describe the bug

  File "C:\Users\Bobyard\miniconda3\envs\py310\lib\site-packages\timm\layers\pos_embed_sincos.py", line 214, in apply_rot_embed_cat
    return x * cos_emb + rot(x) * sin_emb
RuntimeError: The size of tensor a (768) must match the size of tensor b (64) at non-singleton dimension 2

x.shape
Out[3]: torch.Size([BS, 1024, 768])
cos_emb.shape
Out[4]: torch.Size([1024, 64])

x.shape is due to PatchEmbed applied to the input of the model [BS, 3, 448, 448]
('patch_embed', PatchEmbed(
(proj): Conv2d(3, 768, kernel_size=(14, 14), stride=(14, 14))
(norm): Identity()
))
1024 in cos_emb comes from product of patch_embed.grid_size.shape (32x32=1024): feat_shape=None if dynamic_img_size else self.patch_embed.grid_size
64 in cos_emb comes from embed_dim // num_heads = 768 // 12 = 64

To Reproduce
Steps to reproduce the behavior:

  1. timm.create_model('eva02_base_patch14_448.mim_in22k_ft_in1k', num_classes=1000, pretrained=True)
  2. Some basic train loop

image
image

@bobyard-com model works fine for me with builtin train scripts so you'll have to provide a lot more detail on 'basic train loop' and environment if you think it's the lib's issue...

python train.py --dataset hfds/timm/resisc45 --amp -j 8 --model eva02_base_patch14_448.mim_in22k_ft_in1k --num-classes 45  -b 32 --pretrained --opt adamw --lr 1e-4
Training with a single process on 1 device (cuda).
Loading pretrained weights from Hugging Face hub (timm/eva02_base_patch14_448.mim_in22k_ft_in1k)
[timm/eva02_base_patch14_448.mim_in22k_ft_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
Missing keys (head.weight, head.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
Model eva02_base_patch14_448_mim_in22k_ft_in1k created, param count:86383149
Data processing configuration for current model + dataset:
	input_size: (3, 448, 448)
	interpolation: bicubic
	mean: (0.48145466, 0.4578275, 0.40821073)
	std: (0.26862954, 0.26130258, 0.27577711)
	crop_pct: 1.0
	crop_mode: center
Using native Torch AMP. Training in mixed precision.
PyTorch version 2.2.1 available.
Scheduled epochs: 300. LR stepped per epoch.
Train: 0 [   0/590 (  0%)]  Loss: 3.81 (3.81)  Time: 1.598s,   20.03/s  (1.598s,   20.03/s)  LR: 1.000e-05  Data: 0.423 (0.423)
Train: 0 [  50/590 (  8%)]  Loss: 3.67 (3.75)  Time: 0.646s,   49.52/s  (0.662s,   48.37/s)  LR: 1.000e-05  Data: 0.004 (0.013)
Train: 0 [ 100/590 ( 17%)]  Loss: 3.50 (3.67)  Time: 0.648s,   49.38/s  (0.655s,   48.89/s)  LR: 1.000e-05  Data: 0.005 (0.009)

Sorry, we figured it out, this is because you have custom forward function. Our code calls modules of the network sequentially in for loop. So RoPE is applied in attention and not as a first layer.