dynamic reweighting causes performance degradation in reproducing

Question

dynamic reweighting causes performance degradation in reproducing

Charles-Xie opened this issue 3 years ago · comments

Hi,
thanks for sharing the code! Great work!

I have a small question in reproducing your result.
I run the CDN-S model (res50, 3+3). It gave a result of about 31.5 or 31.2 (I run 2 times) after the first training stage (train the whole model with re gular loss). But after the second training stage (decoupled training) is finished, the performance downgrades to 31.0 and 30.4 for these 2 runs separately. For full mAP, rare mAP and non-rare mAP, this trick seems to be not helpful.

So I wonder what could goes wrong during my reproduction or what can be the reason. I will paste the commands and log below. Thanks. Nice day :3

Chi Xie · Answer 1 · Thu Oct 28 2021 11:32:47 GMT+0800 (China Standard Time)

command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage:
log.txt

Chi Xie · Answer 2 · Thu Oct 28 2021 11:39:03 GMT+0800 (China Standard Time)

for the 2nd run:
command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter \
        --num_workers 4

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base_4worker/checkpoint_last.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter \
        --num_workers 4

echo "base_4worker"

corresponding result (log): 31.2 after 1st training stage and 30.4 after 2nd training stage:
log.txt

Yue Liao · Answer 3 · Thu Oct 28 2021 11:39:41 GMT+0800 (China Standard Time)

This module is implemented by @zhangaixi2008, and he will reply you later.

Yue Liao · Answer 4 · Thu Oct 28 2021 11:47:36 GMT+0800 (China Standard Time)

command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage: log.txt

aha 31.5%, a new SOTA with CDN-S.

zhangaixi2008 · Answer 5 · Thu Oct 28 2021 16:15:05 GMT+0800 (China Standard Time)

You can have a try with the following command.
python -m torch.distributed.launch --master_port 10026 --nproc_per_node=4 --use_env main.py --pretrained logs/base/checkpoint_last.pth --output_dir logs/base --hoi --dataset_file hico --hoi_path data/hico_20160224_det --num_obj_classes 80 --num_verb_classes 117 --backbone resnet50 --set_cost_bbox 2.5 --set_cost_giou 1 --bbox_loss_coef 2.5 --giou_loss_coef 1 --num_queries 64 --dec_layers_stage1 3 --dec_layers_stage2 3 --epochs 10 --freeze_mode 1 --obj_reweight --verb_reweight --queue_size 9408 --p_obj 0.7 --p_verb 0.7 --lr 5e-6 --lr_backbone 5e-7 --use_nms_filter

lol · Answer 6 · Thu Nov 25 2021 11:00:10 GMT+0800 (China Standard Time)

@zhangaixi2008 It does not work for me either. The reweighting retraining leads to performance drop.
Details are as follows:

Using given script training on HICO-DET:

CDN S:

best in first 90 epoch: 31.71
fine-tune degrades to 30.96

CDN B:

best in first 90 epochs: 31.6
fine-tune degrades to 30.6

I'm using the above script to re-run cdn-s finetuning.

zhangaixi2008 · Answer 7 · Thu Nov 25 2021 11:16:33 GMT+0800 (China Standard Time)

@Haak0 please upload your model here, let me have a look.

lol · Answer 8 · Thu Nov 25 2021 11:48:37 GMT+0800 (China Standard Time)

@zhangaixi2008 Hi, some of my checkpoints are overwritten. I am re-running the experiments.

lol · Answer 9 · Sat Nov 27 2021 09:47:33 GMT+0800 (China Standard Time)

Hi, I re-do the experiments, and here's the log.
CDN small:

best in first 90 epoch: 30.99
fine-tune degrades to 30.3
Here's the script and log small.txt

CDN base:

best in first 90 epochs: 31.98
fine-tune degrades to 30.6
Here's the script and log base.txt

zhangaixi2008 · Answer 10 · Sat Nov 27 2021 22:02:55 GMT+0800 (China Standard Time)

Hi, I made a mistake in the previous readme for running the fine-tune process. Please use the script I provide above under this issue. As we claimed in the paper, we use a small learning rate to fine-tune the first model. Thus, we set lr as 5e-6 and lr_backbone as 5e-7 for bs=8, or lr as 1e-5 and lr_backbone as 1e-6 for bs=16. Please try again and let us see the results.
Sorry for our carelessness, we have already modified the readme.

lol · Answer 11 · Thu Dec 02 2021 11:03:02 GMT+0800 (China Standard Time)

@zhangaixi2008 hi, I reproduce the finetune result following your script, and the result is reasonable.
CDN-base:

first 90 epoch: 32.05
fine-tune: 32.12
All the result are evaluated using the python script.

BTW, what's the meaning of the "vis_tag" in hico_eval.py?

zhangaixi2008 · Answer 12 · Thu Dec 02 2021 11:30:41 GMT+0800 (China Standard Time)

For CDN-base, you have already surpassed our reported (official matlab 31.78, python 31.86) in our paper. Good job^^
For 'vis_tag', you can see the evaluation script. In short, we filtered the already matched ground-truth hoi to calculate fp and tp during evaluation.

Yue Liao · Answer 13 · Fri Dec 03 2021 11:18:30 GMT+0800 (China Standard Time)

The issue about the "Re-weight module" seems to be solved. If any other issues, feel free to open a new issue.