tahmid0007 / VisionTransformer

A complete easy to follow implementation of Google's Vision Transformer proposed in "AN IMAGE IS WORTH 16X16 WORDS". This pytorch implementation has comments for better understanding.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Help for vit

ozanpkr opened this issue · comments

Hello @tahmid0007 ,
I use yor repo on my dataset.I have 2 classes and image size is 320.Loss value in train and val epoch is huge.How can I solve this problem ?

Epoch: 1
[ 0/ 8379 ( 0%)] Loss: 1.3278
[ 800/ 8379 ( 10%)] Loss: 1.4273
[ 1600/ 8379 ( 19%)] Loss: 2.1631
[ 2400/ 8379 ( 29%)] Loss: 2.9920
[ 3200/ 8379 ( 38%)] Loss: 2.0978
[ 4000/ 8379 ( 48%)] Loss: 1.1218
[ 4800/ 8379 ( 57%)] Loss: 0.6290
[ 5600/ 8379 ( 67%)] Loss: 6.8021
[ 6400/ 8379 ( 76%)] Loss: 15.2172
[ 7200/ 8379 ( 86%)] Loss: 30.6987
[ 8000/ 8379 ( 95%)] Loss: 134.8796
Execution time: 486.72 seconds

Average test loss: 76.5416 Accuracy: 282/ 563 (50.09%)

Epoch: 2
[ 0/ 8379 ( 0%)] Loss: 114.6219
[ 800/ 8379 ( 10%)] Loss: 47.4449
[ 1600/ 8379 ( 19%)] Loss: 10.6302
[ 2400/ 8379 ( 29%)] Loss: 11.6979
[ 3200/ 8379 ( 38%)] Loss: 14.6085
[ 4000/ 8379 ( 48%)] Loss: 15.8703
[ 4800/ 8379 ( 57%)] Loss: 13.2383
[ 5600/ 8379 ( 67%)] Loss: 10.3679
[ 6400/ 8379 ( 76%)] Loss: 24.3444
[ 7200/ 8379 ( 86%)] Loss: 38.1470
[ 8000/ 8379 ( 95%)] Loss: 15.9280
Execution time: 345.13 seconds

Average test loss: 13.4653 Accuracy: 281/ 563 (49.91%)

Transformers need a larger dataset to perform better. Try data augmentation and train for longer to see what happens.

Transformers need a larger dataset to perform better. Try data augmentation and train for longer to see what happens.

Thanks for quick reply @tahmid0007 . I will try your suggestion