Experimental results

Question

Experimental results

lshssel opened this issue 8 months ago · comments

Hi，
Because our group only assigned me a 2080Ti, the training took a long time, for MOWODB's task 1, it took 43 hours.
Unfortunately, on training to the 35th epoch, wandb crashes, so its curve also stops at the 35th epoch.
However, the program is still running without errors, and the file "checkpoint0040.pth" is also generated in the end, and the program can run smoothly when I use this file to train task 2.

Below are the wandb graphs and hyperparameters, which don't work very well, and I may need to tune the parameters as close to the original performance as possible.

K_AP50 is 52.476, U_R50 is 21.042

lshssel · Answer 1 · Sat Dec 02 2023 12:19:06 GMT+0800 (China Standard Time)

Second experiment

Orr Zohar · Answer 2 · Sun Dec 03 2023 12:28:20 GMT+0800 (China Standard Time)

Hi @lshssel,

Hmmm batch_size 1 will be more difficult to fine-tune, but let's try.
Given your experiments, I would try:
lr=2e-5, lr_drop=60, epochs=70

The main idea is that you want the improvement to saturate and then reduce the learning rate. Training continually after the improvement has saturated doesn't help at all (U_R just goes down and AP50 doesn't go up) but if you reduce the lr_drop too soon (before AP50 starts to saturate) then the K_AP50 is 'frozen' too soon and doesn't improve enough.

Best,
Orr

lshssel · Answer 3 · Sun Dec 03 2023 20:30:16 GMT+0800 (China Standard Time)

Thanks for the reply, I'll try later

Orr Zohar · Answer 4 · Sun Dec 17 2023 08:31:29 GMT+0800 (China Standard Time)

Hi @lshssel,
Were your results sufficiently improved?
If so, can you give me all the details re. your system/hyperparamers so I can add it for future users on the README?
Best,
Orr

lshssel · Answer 5 · Sun Dec 17 2023 11:56:11 GMT+0800 (China Standard Time)

one 2080Ti，for all experiments，batch_size=1
about epochs，the first value is in "main_open_world.py"，and the second value is in "M_OWOD_BENCHMARK.sh".
t1.2 : lr=4e-5 lr_backbone=4e-6 epochs=51=41 lr_drop=35 K_AP50=58.36 U_R50=16.50
t1.3 : lr=2e-5 lr_backbone=4e-6 epochs=51=41 lr_drop=35 K_AP50=57.99 U_R50=19.27
t1.4 : lr=2e-5 lr_backbone=4e-6 epochs=56=46 lr_drop=40 K_AP50=57.60 U_R50=18.55
t1.6 : lr=2e-5 lr_backbone=4e-6 epochs=61=41 lr_drop=40 K_AP50=57.17 U_R50=19.34

Looking forward to your suggestions！

lshssel · Answer 6 · Sun Dec 17 2023 11:59:40 GMT+0800 (China Standard Time)

lshssel commented 7 months ago

Orr Zohar · Answer 7 · Mon Dec 18 2023 04:14:07 GMT+0800 (China Standard Time)

Hi @lshssel,
I would like to try something new with you. I had the idea that what is happening is that with different batch sizes, the objectness temperature does need to change.
Good news: no need for training. I would try to use the checkpoints t1.2, t1.3, and re-evaluate with different --obj_temp and sweep a few values (e.g.,0.9, 1.1, 1.2) - default is 1. Should be relatively quick - as you only need to evaluate (use the --eval flag).

Best,
Orr

lshssel · Answer 8 · Mon Dec 18 2023 16:18:27 GMT+0800 (China Standard Time)

I evaluated with t1.3checkpoint40 and t1.2 has been removed. When training, obj_temp = 1.3，obj_loss_coef=8e-4.
I also used obj_loss_coef with different values，but nothing has changed

obj_temp=1.1 K_AP50=57.1914 U_R50=19.2453
obj_temp=1.2 K_AP50=57.6161 U_R50=19.2581
obj_temp=1.3 K_AP50=57.9826 U_R50=19.2624 obj_loss_coef=8e-4
obj_temp=1.4 K_AP50=57.9075 U_R50=19.2367
obj_temp=1.5 K_AP50=57.8653 U_R50=19.2453

obj_loss_coef=4e-4 K_AP50=57.9826 U_R50=19.2624
obj_loss_coef=8e-4 K_AP50=57.9826 U_R50=19.2624
obj_loss_coef=1.6e-3 K_AP50=57.9826 U_R50=19.2624
obj_loss_coef=4e-3 K_AP50=57.9826 U_R50=19.2624

So t1.3 is probably the best result that a 2080Ti can show

Orr Zohar · Answer 9 · Tue Dec 19 2023 06:17:16 GMT+0800 (China Standard Time)

Hi @lshssel,
I want to ensure you understand you don't need to train with a different obj_temp -- you can change this just for evaluation. Unfortunately, it does seem that this is the best result with batch_size=1. Perhaps we could improve it a little more, but probably not much.

I want to add this to the readme. Would you mind providing all the hyperparameters you changed?

lshssel · Answer 10 · Tue Dec 19 2023 12:08:18 GMT+0800 (China Standard Time)

Hi，
Yes, I understand what you mean, I use different obj_temp values for evaluation.
As mentioned earlier, changing the value of the obj_temp did not improve performance.
If batch size=2, then cuda out of memory, so it can only be 1 on a 2080Ti(11G).
My hyperparameter is set to:
lr=2e-5 lr_backbone=4e-6 batch size=1, nothing else has changed.
Thank you again for your excellent work and answering my questions.

waap · Answer 11 · Thu Jun 06 2024 09:12:41 GMT+0800 (China Standard Time)

Hello, I also used a 2080Ti card to complete the entire experiment. I conducted the experiment according to "lr=2e-5, lr'backbone=4e-6, batch size=1, obj_temp=1.3". My results are shown in the following figure. I don't know why some results are actually higher than the results mentioned in the paper. By the way, it took me about 8 days to complete the entire experiment

Orr Zohar · Answer 12 · Mon Jun 17 2024 22:32:03 GMT+0800 (China Standard Time)

Hi @WangPingA,

When you train a model with a different batch size, your results will vary. That is because your gradient updates will not be the same, as you use a different batch size. Variations of +-2 seem reasonable.

lshssel also ran experiments with a 2080Ti, and got:

If you are interested in applications, then perhaps my recent work, FOMO, will interest you; it is much less compute-heavy to train and will have relatively strong open-world performance by leveraging foundation object detection model. An easy upgrade there is to switch owl-vit to owlv2.

Best,
Orr