I can't achieve the accuracy in bench mark, could somebody help?

Question

I can't achieve the accuracy in bench mark, could somebody help?

GranMin opened this issue 3 years ago · comments

I use the same train dataset and test dataset as you proposed, but the best result I've got so far is as the picture shows.
I used SGD optimizer and lr=0.1,0.05,0.01,0.0001,0.00001, each lr an epoch. And when I found the loss increasing rather than decreasing, I stoped training. And I got the test result for loss 19.42 as up picture.
More, this is test result when the train loss is 21.15, shown as down picture.

Andrey · Answer 1 · Thu Aug 12 2021 23:46:51 GMT+0800 (China Standard Time)

@GranMin what backbone do you use?
I tried MobileNetV2 and got similar to author's results.
I used constant learning rate = 0.01 during about 10 epochs and achieved loss about 10.
I think you should try similar to my lr schedule at first and decrease lr only after that (and maybe not so fast). Otherwise your model haven't enough time to use relatively big gradients to decrease loss.

GranMin · Answer 2 · Fri Aug 13 2021 09:21:13 GMT+0800 (China Standard Time)

@Androsimus I used the Resnet50. I will try and reply as soon. Thanks for your advice.

GranMin · Answer 3 · Wed Aug 18 2021 11:11:49 GMT+0800 (China Standard Time)

the newest info: I use the Resnet50, SGD optimizer and set lr=0.01, after 30 epoch, I reached 99.08 in lfw, and 92 in AgeDB-30, as shown follow.

Andrey · Answer 4 · Wed Aug 18 2021 19:05:11 GMT+0800 (China Standard Time)

@GranMin Maybe use of 30 epochs is too much and you have got some overfitting?..
Do you have previous checkpoints from 10-15 epochs? If yes, try to check them on val datasets.

GranMin · Answer 5 · Thu Aug 19 2021 09:18:36 GMT+0800 (China Standard Time)

@Androsimus I have the same feeling of too much epochs.But the loss of 10-15 epoch is about 20, and the acc of lfw is about 98.

Andrey · Answer 6 · Thu Aug 19 2021 14:53:13 GMT+0800 (China Standard Time)

@GranMin This is very strange. Maybe you changed some other parameters? Maybe parameters of Arcface: margin, scale?
Because there are other issues, where persons wrote about good results using Resnet50 on native for this repository dataset.

GranMin · Answer 7 · Thu Aug 19 2021 14:59:41 GMT+0800 (China Standard Time)

@Androsimus I don't change any other parameters. And I tried NormHead for one epoch, then use the Archead, it's amazing that just after one epoch in Archead, the loss comes to about 11. But then the same phenomenon took place: the loss increase a few at the begin of the epoch, and then decrease, but at very low speed. Like this:

Andrey · Answer 8 · Thu Aug 19 2021 16:20:48 GMT+0800 (China Standard Time)

@GranMin I'm not sure how NormHead is supposed to use. Maybe as a warmup.
But the NormHead and the ArcHead are completely different. As far as I understand, NormHead is for ordinary classification, if so, then classification problem is much easier and due to this you achieve low loss much faster. Otherwise, when you use ArcHead with its margin and scale parameters and its different concept the classification problem becomes harder.
But I don't understand your situation: on the one hand you said about loss ~20 on 10-15 epochs, on the other hand after first epoch you had loss <11 and then it decreased at low speed...

To sum up. I used strictly ArcHead, I suppose other people did it too. So I propose to try using only ArcHead.

GranMin · Answer 9 · Thu Aug 19 2021 16:38:53 GMT+0800 (China Standard Time)

@Androsimus In fact, I tried two times about training the model recently.
The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw.
The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down.
As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.

Andrey · Answer 10 · Thu Aug 19 2021 17:12:38 GMT+0800 (China Standard Time)

@GranMin
If you look at this post and #4 (comment) and thread, then you will see big differences from your results. Strange.
Nevertheless, if you have your old logs and chechpoints from training using only ArcHead, then try to find checkpoint that correspond to loss about 8-9 and try it on validation datasets.

GranMin · Answer 11 · Thu Aug 19 2021 17:20:48 GMT+0800 (China Standard Time)

results as:
first for loss 9.16 and second for loss 8.00

Andrey · Answer 12 · Thu Aug 19 2021 17:53:55 GMT+0800 (China Standard Time)

@GranMin this is some mystery )

Andrey · Answer 13 · Fri Aug 20 2021 14:45:36 GMT+0800 (China Standard Time)

@GranMin could you post your config file *.yaml?
Anyway if you figure out a reason of that strange model training behavior, please write about it.

Andrey · Answer 14 · Fri Aug 27 2021 20:28:44 GMT+0800 (China Standard Time)

@GranMin There is one idea. For correct inference model must be used as
model(input, training=False)
The author didn't use it for some reason.
So you can try to add training=False in modules/evaluations.py to perform_val function.

GranMin · Answer 15 · Mon Aug 30 2021 11:48:19 GMT+0800 (China Standard Time)

@Androsimus Sorry for long waiting. Hahah...I just took a vacation to Jiuzhai Gou nature reserve last week.
I talk to my teacher and then I know they broaden my dataset with some asian faces.
I redown the dataset the author provided, and got this result just for 2 epoch, just use arcface head.

I decide to train resnet152 from scratch following time. And the experience that use softmax first may also help.
Again, thank you for your advice~best wishes, my friend!

GranMin · Answer 16 · Mon Aug 30 2021 16:34:09 GMT+0800 (China Standard Time)

And for 5 epoch finished, this is final result:

Andrey · Answer 17 · Mon Aug 30 2021 18:50:43 GMT+0800 (China Standard Time)

@GranMin glad you got nice results :)
Best wishes!

xalbertoisorna · Answer 18 · Sat Dec 09 2023 02:40:02 GMT+0800 (China Standard Time)

@Androsimus In fact, I tried two times about training the model recently. The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw. The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down. As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.

How do you change head (normhead to archead)? I tried it but I have this error:
raise ValueError( ValueError: Cannot assign value to variable ' conv2d/bias:0': Shape mismatch.The variable shape (24,), and the assigned value shape (32,) are incompatible.