Four 3090 cannot get the the authors' results,that why?

Question

Four 3090 cannot get the the authors' results,that why?

Rzx520 opened this issue 8 months ago · comments

          > As you can see, I got the same results as the @orrzohar show in the paper. I wonder how many cards you used with batch_size = 2. I think if you use a single card, the result may be worse than I got (I used four cards with batch_size = 3) @Rzx520 . By the way, what are your final results? Are they far from the authors' results?

I used four cards with batch_size = 3,the result is :

{"train_lr": 1.999999999999943e-05, "train_class_error": 15.52755644357749, "train_grad_norm": 119.24543388206256, "train_loss": 5.189852057201781, "train_loss_bbox": 0.2700958194790585, "train_loss_bbox_0": 0.29624945830832017, "train_loss_bbox_1": 0.27978440371434526, "train_loss_bbox_2": 0.275065722955665, "train_loss_bbox_3": 0.27241891570675625, "train_loss_bbox_4": 0.27063051075218725, "train_loss_ce": 0.18834440561282928, "train_loss_ce_0": 0.27234036786085974, "train_loss_ce_1": 0.23321395799885028, "train_loss_ce_2": 0.20806531186409408, "train_loss_ce_3": 0.19453731594314128, "train_loss_ce_4": 0.18820172232765492, "train_loss_giou": 0.3351372324140976, "train_loss_giou_0": 0.3679243937037491, "train_loss_giou_1": 0.3483400315024699, "train_loss_giou_2": 0.34171414935044225, "train_loss_giou_3": 0.3379105142249501, "train_loss_giou_4": 0.3368650070453053, "train_loss_obj_ll": 0.02471167313379382, "train_loss_obj_ll_0": 0.034151954339996814, "train_loss_obj_ll_1": 0.03029250531194649, "train_loss_obj_ll_2": 0.0288731191750343, "train_loss_obj_ll_3": 0.028083207809715446, "train_loss_obj_ll_4": 0.026900355121292352, "train_cardinality_error_unscaled": 0.44506890101437985, "train_cardinality_error_0_unscaled": 0.6769398279525907, "train_cardinality_error_1_unscaled": 0.5726976196583499, "train_cardinality_error_2_unscaled": 0.4929900999093851, "train_cardinality_error_3_unscaled": 0.46150593285633223, "train_cardinality_error_4_unscaled": 0.45256225438417086, "train_class_error_unscaled": 15.52755644357749, "train_loss_bbox_unscaled": 0.054019163965779084, "train_loss_bbox_0_unscaled": 0.059249891647616536, "train_loss_bbox_1_unscaled": 0.055956880831476395, "train_loss_bbox_2_unscaled": 0.055013144572493046, "train_loss_bbox_3_unscaled": 0.054483783067331704, "train_loss_bbox_4_unscaled": 0.05412610215448962, "train_loss_ce_unscaled": 0.09417220280641464, "train_loss_ce_0_unscaled": 0.13617018393042987, "train_loss_ce_1_unscaled": 0.11660697899942514, "train_loss_ce_2_unscaled": 0.10403265593204704, "train_loss_ce_3_unscaled": 0.09726865797157064, "train_loss_ce_4_unscaled": 0.09410086116382746, "train_loss_giou_unscaled": 0.1675686162070488, "train_loss_giou_0_unscaled": 0.18396219685187454, "train_loss_giou_1_unscaled": 0.17417001575123495, "train_loss_giou_2_unscaled": 0.17085707467522113, "train_loss_giou_3_unscaled": 0.16895525711247505, "train_loss_giou_4_unscaled": 0.16843250352265265, "train_loss_obj_ll_unscaled": 30.889592197686543, "train_loss_obj_ll_0_unscaled": 42.68994404527915, "train_loss_obj_ll_1_unscaled": 37.86563257517548, "train_loss_obj_ll_2_unscaled": 36.09139981038161, "train_loss_obj_ll_3_unscaled": 35.10401065181873, "train_loss_obj_ll_4_unscaled": 33.62544476769816, "test_metrics": {"WI": 0.05356004827184098, "AOSA": 5220.0, "CK_AP50": 58.3890380859375, "CK_P50": 25.75118307055908, "CK_R50": 71.51227713815234, "K_AP50": 58.3890380859375, "K_P50": 25.75118307055908, "K_R50": 71.51227713815234, "U_AP50": 2.7862398624420166, "U_P50": 0.409358215516747, "U_R50": 16.530874785591767}, "test_coco_eval_bbox": [14.451444625854492, 14.451444625854492, 77.8148193359375, 57.15019607543945, 66.93928527832031, 49.282108306884766, 27.985671997070312, 70.54130554199219, 55.28901290893555, 82.7206039428711, 26.307403564453125, 65.15182495117188, 21.9127197265625, 77.91541290283203, 73.61457061767578, 67.8846206665039, 49.1287841796875, 36.78118896484375, 69.1879653930664, 53.060150146484375, 79.1402359008789, 59.972835540771484, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7862398624420166], "epoch": 40, "n_parameters": 39742295}

the authors' results is :
U-R:19.4,K-AP:59.5
Why is it that the author's performance cannot be achieved?
@Hatins @orrzohar

Originally posted by @Rzx520 in #26 (comment)

Orr Zohar · Answer 1 · Mon Nov 20 2023 10:04:43 GMT+0800 (China Standard Time)

Hi @Rzx520,
When you change optimization hyperparameters, the results will inevitably change. That is true for PROB and nearly all deep learning models.

Luckily, PROB is relatively robust and requires minimal hyperparameter tuning to match our performance, at least on all the systems I have encountered. Specifically with Titan RTX 3090, our results were already reproduced (see Issue #26). On a system of 3090x4, lr_drop needed to be increased to 40 to match our reported results. If you have a different number of GPUs, there may be a better one for your system.

I am happy to help with this process, but to do so, I need to see your training curves.

Best,
Orr

Rzx520 · Answer 2 · Tue Nov 21 2023 17:33:33 GMT+0800 (China Standard Time)

The above result is the result of adjusting lr_drop to 40, so I am quite confused.

Rzx520 · Answer 3 · Tue Nov 21 2023 18:20:15 GMT+0800 (China Standard Time)

#26 (comment)

Orr Zohar · Answer 4 · Wed Nov 22 2023 11:14:51 GMT+0800 (China Standard Time)

Did you use the same number of GPUs is in #26 ?
If not, then if you share your training curves I could try and help you with hyperparameter optimization.

Rzx520 · Answer 5 · Wed Nov 22 2023 15:41:09 GMT+0800 (China Standard Time)

Yes, I also used 4GPUs. Thank you very much. Since I turned off Wandb, I had to retrain to obtain the training curves.This may take a while as the server is being used.

Rzx520 · Answer 6 · Thu Nov 23 2023 09:34:39 GMT+0800 (China Standard Time)

Above is the result of this parameter setting training. @orrzohar

################ Deformable DETR ################
parser.add_argument('--lr', default=2e-4, type=float)
parser.add_argument('--lr_backbone_names', default=["backbone.0"], type=str, nargs='+')
parser.add_argument('--lr_backbone', default=2e-5, type=float)
parser.add_argument('--lr_linear_proj_names', default=['reference_points', 'sampling_offsets'], type=str, nargs='+')
parser.add_argument('--lr_linear_proj_mult', default=0.1, type=float)
#parser.add_argument('--batch_size', default=5, type=int)
#parser.add_argument('--batch_size', default=3, type=int)
parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=51, type=int)
#parser.add_argument('--lr_drop', default=35, type=int)
parser.add_argument('--lr_drop', default=40, type=int)

parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+')
parser.add_argument('--clip_max_norm', default=0.1, type=float,
                    help='gradient clipping max norm')
parser.add_argument('--sgd', action='store_true')

Orr Zohar · Answer 7 · Sat Nov 25 2023 07:49:20 GMT+0800 (China Standard Time)

Hi @Rzx520,
You are overtraining the model, should reduce the lr_drop to ~~150k iterations (lr_drop~~30).
I am concerned that you are using the same system as in #26 but getting different optimization results, I wonder how the two systems differ.
Best,
Orr

Rzx520 · Answer 8 · Sat Nov 25 2023 08:59:05 GMT+0800 (China Standard Time)

I am trying lr_drop=30, I will send it back when the training results are available.And I also wonder how the two systems differ,so I asked some questions in #26 (comment)

Orr Zohar · Answer 9 · Sun Nov 26 2023 11:47:21 GMT+0800 (China Standard Time)

Hi @Rzx520,
I see, I do not know Hatins, so I have no way of facilitating communication.
I am very surprised that you both used 4x3090s, but each needed different results.

Rzx520 · Answer 10 · Sun Nov 26 2023 15:41:38 GMT+0800 (China Standard Time)

Above is the result of this parameter setting training, lr_drop = 30. @orrzohar

################ Deformable DETR ################
parser.add_argument('--lr', default=2e-4, type=float)
parser.add_argument('--lr_backbone_names', default=["backbone.0"], type=str, nargs='+')
parser.add_argument('--lr_backbone', default=2e-5, type=float)
parser.add_argument('--lr_linear_proj_names', default=['reference_points', 'sampling_offsets'], type=str, nargs='+')
parser.add_argument('--lr_linear_proj_mult', default=0.1, type=float)
#parser.add_argument('--batch_size', default=5, type=int)
#parser.add_argument('--batch_size', default=3, type=int)
parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=51, type=int)
#parser.add_argument('--lr_drop', default=35, type=int)
parser.add_argument('--lr_drop', default=40, type=int)

parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+')
parser.add_argument('--clip_max_norm', default=0.1, type=float,
help='gradient clipping max norm')
parser.add_argument('--sgd', action='store_true')

Orr Zohar · Answer 11 · Mon Nov 27 2023 04:47:26 GMT+0800 (China Standard Time)

Hi @Rzx520,
I noticed that you used batch_size=2 not batch_size=3 like in Hatins in #26.
Why is that the case? That could be a reason for the U_R50 discrepancy.
A broad note: a general trend I see is that the smaller the batch size, the less training that can be done without hurting U_R50.

I also noticed that Hatins reported simulary poorer results when using batch_size=2.
Best,
Orr

Rzx520 · Answer 12 · Mon Nov 27 2023 09:18:20 GMT+0800 (China Standard Time)

          > As you can see, I got the same results as the @orrzohar show in the paper. I wonder how many cards you used with batch_size = 2. I think if you use a single card, the result may be worse than I got (I used four cards with batch_size = 3) @Rzx520 . By the way, what are your final results? Are they far from the authors' results?
I used four cards with batch_size = 3,the result is :

{"train_lr": 1.999999999999943e-05, "train_class_error": 15.52755644357749, "train_grad_norm": 119.24543388206256, "train_loss": 5.189852057201781, "train_loss_bbox": 0.2700958194790585, "train_loss_bbox_0": 0.29624945830832017, "train_loss_bbox_1": 0.27978440371434526, "train_loss_bbox_2": 0.275065722955665, "train_loss_bbox_3": 0.27241891570675625, "train_loss_bbox_4": 0.27063051075218725, "train_loss_ce": 0.18834440561282928, "train_loss_ce_0": 0.27234036786085974, "train_loss_ce_1": 0.23321395799885028, "train_loss_ce_2": 0.20806531186409408, "train_loss_ce_3": 0.19453731594314128, "train_loss_ce_4": 0.18820172232765492, "train_loss_giou": 0.3351372324140976, "train_loss_giou_0": 0.3679243937037491, "train_loss_giou_1": 0.3483400315024699, "train_loss_giou_2": 0.34171414935044225, "train_loss_giou_3": 0.3379105142249501, "train_loss_giou_4": 0.3368650070453053, "train_loss_obj_ll": 0.02471167313379382, "train_loss_obj_ll_0": 0.034151954339996814, "train_loss_obj_ll_1": 0.03029250531194649, "train_loss_obj_ll_2": 0.0288731191750343, "train_loss_obj_ll_3": 0.028083207809715446, "train_loss_obj_ll_4": 0.026900355121292352, "train_cardinality_error_unscaled": 0.44506890101437985, "train_cardinality_error_0_unscaled": 0.6769398279525907, "train_cardinality_error_1_unscaled": 0.5726976196583499, "train_cardinality_error_2_unscaled": 0.4929900999093851, "train_cardinality_error_3_unscaled": 0.46150593285633223, "train_cardinality_error_4_unscaled": 0.45256225438417086, "train_class_error_unscaled": 15.52755644357749, "train_loss_bbox_unscaled": 0.054019163965779084, "train_loss_bbox_0_unscaled": 0.059249891647616536, "train_loss_bbox_1_unscaled": 0.055956880831476395, "train_loss_bbox_2_unscaled": 0.055013144572493046, "train_loss_bbox_3_unscaled": 0.054483783067331704, "train_loss_bbox_4_unscaled": 0.05412610215448962, "train_loss_ce_unscaled": 0.09417220280641464, "train_loss_ce_0_unscaled": 0.13617018393042987, "train_loss_ce_1_unscaled": 0.11660697899942514, "train_loss_ce_2_unscaled": 0.10403265593204704, "train_loss_ce_3_unscaled": 0.09726865797157064, "train_loss_ce_4_unscaled": 0.09410086116382746, "train_loss_giou_unscaled": 0.1675686162070488, "train_loss_giou_0_unscaled": 0.18396219685187454, "train_loss_giou_1_unscaled": 0.17417001575123495, "train_loss_giou_2_unscaled": 0.17085707467522113, "train_loss_giou_3_unscaled": 0.16895525711247505, "train_loss_giou_4_unscaled": 0.16843250352265265, "train_loss_obj_ll_unscaled": 30.889592197686543, "train_loss_obj_ll_0_unscaled": 42.68994404527915, "train_loss_obj_ll_1_unscaled": 37.86563257517548, "train_loss_obj_ll_2_unscaled": 36.09139981038161, "train_loss_obj_ll_3_unscaled": 35.10401065181873, "train_loss_obj_ll_4_unscaled": 33.62544476769816, "test_metrics": {"WI": 0.05356004827184098, "AOSA": 5220.0, "CK_AP50": 58.3890380859375, "CK_P50": 25.75118307055908, "CK_R50": 71.51227713815234, "K_AP50": 58.3890380859375, "K_P50": 25.75118307055908, "K_R50": 71.51227713815234, "U_AP50": 2.7862398624420166, "U_P50": 0.409358215516747, "U_R50": 16.530874785591767}, "test_coco_eval_bbox": [14.451444625854492, 14.451444625854492, 77.8148193359375, 57.15019607543945, 66.93928527832031, 49.282108306884766, 27.985671997070312, 70.54130554199219, 55.28901290893555, 82.7206039428711, 26.307403564453125, 65.15182495117188, 21.9127197265625, 77.91541290283203, 73.61457061767578, 67.8846206665039, 49.1287841796875, 36.78118896484375, 69.1879653930664, 53.060150146484375, 79.1402359008789, 59.972835540771484, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7862398624420166], "epoch": 40, "n_parameters": 39742295}

the authors' results is : U-R:19.4,K-AP:59.5 Why is it that the author's performance cannot be achieved? @Hatins @orrzohar

Originally posted by @Rzx520 in #26 (comment)

What I tried at the beginning was batch size=3, and the results are shown above.Setting batch size to 2 is due to the parameter settings of OW-DETR.@orrzohar

Rzx520 · Answer 13 · Mon Nov 27 2023 09:24:38 GMT+0800 (China Standard Time)

I have some gains now, which is that when I set lr to 1e-4, lr_ When lr_drop=35 and batch size=3, there are some gains, but K_ AP only reached 58.3, not 59.4. Can you provide some suggestions?

################ Deformable DETR ################
parser.add_argument('--lr', default=1e-4, type=float)
parser.add_argument('--lr_backbone_names', default=["backbone.0"], type=str, nargs='+')
parser.add_argument('--lr_backbone', default=2e-5, type=float)
parser.add_argument('--lr_linear_proj_names', default=['reference_points', 'sampling_offsets'], type=str, nargs='+')
parser.add_argument('--lr_linear_proj_mult', default=0.1, type=float)
#parser.add_argument('--batch_size', default=5, type=int)
parser.add_argument('--batch_size', default=3, type=int)
#parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=51, type=int)
#parser.add_argument('--lr_drop', default=30, type=int)
parser.add_argument('--lr_drop', default=35, type=int)
#parser.add_argument('--lr_drop', default=40, type=int)

parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+')
parser.add_argument('--clip_max_norm', default=0.1, type=float,
                    help='gradient clipping max norm')
parser.add_argument('--sgd', action='store_true')

Orr Zohar · Answer 14 · Mon Nov 27 2023 10:24:55 GMT+0800 (China Standard Time)

Hi @Rzx520,
Are you still using 4 x Titan RTX ?
Generally, to get higher K_AP50, you need to train for longer, but the longer you train U_R goes down. The trick is to hit the balance between the two. ?
Looking at your chart, I think you can reduce lr_drop to 30 as the last 5 epochs are saturated before the lr_drop. This will give you 5 additional epochs with the lower learning rate and will hopefully improve the results.

To clarify, to run this experiment you DO NOT need to restart from scratch -- your model should have saved the checkpoint for epoch 30 and then you only need to train for the last 10 epochs after the lr_drop. Just make sure the lr is indeed lowered.

Best,
Orr

Rzx520 · Answer 15 · Mon Nov 27 2023 10:33:14 GMT+0800 (China Standard Time)

I am trying lr_drop=30, I will present the results here.@orrzohar

Orr Zohar · Answer 16 · Mon Nov 27 2023 10:54:40 GMT+0800 (China Standard Time)

Hi @Rzx520,
OK great, thank you.
Would you mind confirming what system you are using for future reproducibility on simulary systems?
Best,
Orr

Rzx520 · Answer 17 · Tue Nov 28 2023 09:28:22 GMT+0800 (China Standard Time)

lr_drop=30 and parser.add_argument('--eval_every', default=1, type=int) @orrzohar This effect is not as good as lr_drop = 35.

Linux ubuntu 5.15.0-86-generic #96~20.04.1-Ubuntu

Orr Zohar · Answer 18 · Sun Dec 03 2023 12:23:59 GMT+0800 (China Standard Time)

Hi @Rzx520,
OK I am trying to compile everything we have seen thus far:

--lr 2e-4, --lr_drop 40, --epochs 51 --batch_size 2 -> AP50=58.4, U_R=16.5
--lr 1e-4, --lr_drop 35, --epochs 51 --batch_size 3 -> AP50=58.4, U_R=19.4
--lr 1e-4, --lr_drop 30, --epochs 51 --batch_size 3 -> less good then above

Is that correct? Also, have you tried (as lr_drop 35->30 had the adverse affect):
--lr 1e-4, --lr_drop 40, --epochs 51 --batch_size 3

Best,
Orr

Rzx520 · Answer 19 · Tue Dec 05 2023 14:37:39 GMT+0800 (China Standard Time)

yes,I do.
--lr 1e-4, --lr_drop 30, --epochs 41 --batch_size 3

AP50=58.1 U_R=19.5

@orrzohar There's one issue, the above results epochs are not default values, but 41.

Orr Zohar · Answer 20 · Sun Dec 17 2023 08:32:35 GMT+0800 (China Standard Time)

Hi @Rzx520,

Are the results above for:
--lr 1e-4, --lr_drop 40, --epochs 51 --batch_size 3?

And of course the hyper parameters are changed -- you changed the batch size as it did not fit on your GPUs. This will change other hyper parameters.
Best,
Orr

Orr Zohar · Answer 21 · Thu Dec 28 2023 06:04:22 GMT+0800 (China Standard Time)

Hi @Rzx520,
I am closing this for now. If you can confirm what configuration you used to get the best results, I will add this to the README.
Best,
Orr