huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNet-V3/V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Home Page:https://huggingface.co/docs/timm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] The last n-batches in the log always show 0.00% accuracy

shunmian opened this issue · comments

Hey, first thanks for the fantastic code!

Describe the bug

The last n-batches in the log always show 0.00% accuracy despite of how many epochs has been run.

                     Test: [8600/8722]  Time: 0.09  Loss:  0.21 (0.243)  Acc@1: 100.00 (95.60)  Acc@5: 100.00 (99.83)
                     Test: [8650/8722]  Time: 0.10  Loss:  2.81 (0.247)  Acc@1: 12.50 (95.42)  Acc@5: 100.00 (99.83)
                     Test: [8700/8722]  Time: 0.09  Loss:  0.11 (0.253)  Acc@1: 100.00 (95.26)  Acc@5: 100.00 (99.83)
Acc always 0.00 ->   Test: [8722/8722]  Time: 0.03  Loss:  5.06 (0.254)  Acc@1:  0.00 (95.24)  Acc@5: 50.00 (99.83) 

To Reproduce
Steps to reproduce the behavior:

./distributed_train.sh 1 "/home/pytorch-image-models/Datasets/1" --model timm/maxxvitv2_rmlp_base_rw_384.sw_in12k_ft_in1k --lr 0.0005 --warmup-epochs 0 --epochs 50 --weight-decay 1e-4 --sched cosine --scale 0.8 1 --aa rand-m1-n1-mstd0.01-mmax5 -b 24 -j 6 --amp --dist-bn reduce --num-classes 500 --pretrained --class-map  "/home/pytorch-image-models/Datasets1/class.txt" --input-size 3 384 384

Expected behavior
last n-batches should be trained to produce Acc highter than 0.00

Desktop (please complete the following information):

  • OS: ubunntu 20.04
  • timm: 0.8.19dev0
  • PyTorch version: 1.12.1+cu113

@shunmian very unlikely there is a bug with the code/scripts, you should try to shuffle your validation set to see what happens, I expect samples model can't predict will be spread out and you won't see it mostly lumped in last batch. There are clearly other trouble batches

the fact that you are getting either close to 100 or 0 for batches suggests your dataset is imbalanced, which means accuracy is a poor metric, timms train scripts are biased towards imagenet style pretrain which is fairly balanced. You'd be better of changing the scripts to use some sort of f-score.