Extremely hard incremental scenario

Question

Extremely hard incremental scenario

qjadud1994 opened this issue 3 years ago · comments

Thank you for your great work, PLOP!

As your code, we can reproduce the performance almost the same as your paper.

Additionally, we also conduct experiments on extremely hard incremental scenarios, such as 5-1 (16 steps) and 2-1 (19 steps).

for this, we add below lines on task.py

"5-1": {
        0 : [0, 1, 2, 3, 4, 5],
        1 : [6, ],
        2 : [7, ],
        3 : [8, ],
        4 : [9, ],
        5 : [10, ],
        6 : [11, ],
        7 : [12, ],
        8 : [13, ],
        9 : [14, ],
        10: [15, ],
        11: [16, ],
        12: [17, ],
        13: [18, ],
        14: [19, ],
        15: [20, ],
    },   
    "2-1":{
        0 : [0, 1, 2],
        1 : [3, ],
        2 : [4, ],
        3 : [5, ],
        4 : [6, ],
        5 : [7, ],
        6 : [8, ],
        7 : [9, ],
        8 : [10, ],
        9 : [11, ],
        10: [12, ],
        11: [13, ],
        12: [14, ],
        13: [15, ],
        14: [16, ],
        15: [17, ],
        16: [18, ],
        17: [19, ],
        18: [20, ],
    },

However, during the training, the loss is divergence to nan.

I already noticed that someone suffers from the loss divergence issue #8 on 15-5 task, however, I can reproduce the performance on 15-5 task in my environmental settings.
Also, MiB on these extremely hard scenarios was well trained without the loss divergence, however, PLOP showed the issue.

Therefore, I wonder you also have the same issue in the extremely hard scenarios, 5-1 and 2-1.
And, please tell me how can I solve that issue (e.g., which hyperparameter should be changed).

Thanks.

Arthur Douillard · Answer 1 · Mon Aug 09 2021 20:33:23 GMT+0800 (China Standard Time)

Hey,

I'm happy to see others people try hard scenarios with only 1 class at the time :D ! It's so important but so few people even try that...
Do you have mixed precision activated (relative to issue #8)?
Have you tried to add gradient clipping? https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html You may have a problem of gradient explosion due to the LocalPOD loss

I'll try this week to see if I can run your experiments. However, just to know when does loss diverge? In which tasks and epochs? Like if I'm running only 1 epoch / task can I get this error?

Beom · Answer 2 · Thu Aug 12 2021 11:29:05 GMT+0800 (China Standard Time)

Thank you for your prompt reply!

we tried to both with amp and without amp. However, both showed the same divergence loss.
we did not try any other technique. we strictly follow the original implementation.

On 15-5, 19-1 tasks, we showed the loss=nan in a certain epoch, however, the loss returns back to normal value in the following epoch.
However, on 5-1 or 2-1 tasks, the loss=nan cannot return back to normal value.

Arthur Douillard · Answer 3 · Thu Aug 12 2021 19:49:34 GMT+0800 (China Standard Time)

Ok, I don't have a lot of GPUs right now for PLOP, so I may take a few days to sort this problem out.

I'll first try to reproduce it w/ 5-1, and then add gradnorm clipping to see if that solve the exploding loss.

I'll keep you updated.

Arthur Douillard · Answer 4 · Fri Aug 13 2021 20:24:23 GMT+0800 (China Standard Time)

I got some results with 5-1:

INFO:rank0: Closing the Logger.
Last Step: 15
Final Mean IoU 5.62
Average Mean IoU 26.01
Mean IoU first 3.64
Mean IoU last 6.41
voc_5-1_PLOP On GPUs 0,1
Run in 33315s

Given the following script:

#!/bin/bash

set -e

start=`date +%s`

START_DATE=$(date '+%Y-%m-%d')

PORT=$((9000 + RANDOM % 1000))
GPU=0,1
NB_GPU=2


DATA_ROOT=/local/douillard/pascal_voc_2012

DATASET=voc
TASK=5-1
NAME=PLOP
METHOD=PLOP
OPTIONS="--checkpoint /local/douillard/continual_segmentation/checkpoints/step"

SCREENNAME="${DATASET}_${TASK}_${NAME} On GPUs ${GPU}"

RESULTSFILE=results/${START_DATE}_${DATASET}_${TASK}_${NAME}.csv
rm -f ${RESULTSFILE}

echo -ne "\ek${SCREENNAME}\e\\"

echo "Writing in ${RESULTSFILE}"

# If you already trained the model for the first step, you can re-use those weights
# in order to skip this initial step --> faster iteration on your model
# Set this variable with the weights path
# FIRSTMODEL=/path/to/my/first/weights
# Then, for the first step, append those options:
# --ckpt ${FIRSTMODEL} --test
# And for the second step, this option:
# --step_ckpt ${FIRSTMODEL}

BATCH_SIZE=12
INITIAL_EPOCHS=30
EPOCHS=10

CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 0 --lr 0.01 --epochs ${INITIAL_EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 1 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 2 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 3 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 4 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 5 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 6 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 7 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 8 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 9 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 10 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 11 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 12 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 13 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 14 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
CUDA_VISIBLE_DEVICES=${GPU} python3 -m torch.distributed.launch --master_port ${PORT} --nproc_per_node=${NB_GPU} run.py --date ${START_DATE} --data_root ${DATA_ROOT} --overlap --batch_size ${BATCH_SIZE} --dataset ${DATASET} --name ${NAME} --task ${TASK} --step 15 --lr 0.001 --epochs ${EPOCHS} --method ${METHOD} --opt_level O1 ${OPTIONS}
python3 average_csv.py ${RESULTSFILE}

echo ${SCREENNAME}


end=`date +%s`
runtime=$((end-start))
echo "Run in ${runtime}s"

So I didn't have neither exploding loss (pod loss is around 7 at most), nor nan loss.

Beom · Answer 5 · Sat Aug 14 2021 15:50:37 GMT+0800 (China Standard Time)

Thank you for the experiment!

Actually, we thought the below log means the loss divergence, but that doesn't seem to be the case.

INFO:rank0: Epoch 28, lr = 0.000087
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.5
INFO:rank0: Epoch 28, Batch 10/22, Loss=4.541375775728374
INFO:rank0: Loss made of: CE 0.04276547580957413, LKD 0.0, LDE 0.0, LReg 0.0, POD 4.295156478881836 EntMin 0.0
INFO:rank0: Epoch 28, Batch 20/22, Loss=4.600663200672716
INFO:rank0: Loss made of: CE 0.0746140405535698, LKD 0.0, LDE 0.0, LReg 0.0, POD 3.7756705284118652 EntMin 0.0
INFO:rank0: Epoch 28, Class Loss=nan, Reg Loss=0.0
INFO:rank0: End of Epoch 28/30, Average Loss=nan, Class Loss=nan, Reg Loss=0.0
WARNING:rank0: "" or Inf found in input tensor.
WARNING:rank0: "" or Inf found in input tensor.

In fact, the reason for this question is that the performance of MiB is higher than PLOP in the hard scenario.
Currently, I analyze some class-incremental semantic segmentation methods in various scenarios.
For this, I measured the final mIoU (all) on three hard scenarios, and the below experimental result shows that MiB outperforms PLOP.
So, I'm having a hard time figuring out the cause.
Could you guess why these results are happening?

Task	MiB	PLOP
5-1 (16-steps)	10.03%	6.46%
2-1 (19-steps)	9.88%	4.47%
2-2 (10-steps)	25.60%	13.66%

Arthur Douillard · Answer 6 · Wed Aug 18 2021 19:49:36 GMT+0800 (China Standard Time)

I'm not sure about that.

How many epochs did you used for both MiB and PLOP?

Regardless, PLOP (and its predecessor PODNet) seems to struggle to learn with small first task. That may be a reason.

Or you could tune the factor of the Local POD loss.

Beom · Answer 7 · Mon Aug 23 2021 18:05:52 GMT+0800 (China Standard Time)

I used 30 epochs following the default setting.

Lastly, could you give me some tips on which hyperparameters should I tune?

opts.pod = "local"
opts.pod_factor = 0.01
opts.pod_logits = True
opts.pod_options = {"switch": {"after": {"extra_channels": "sum", "factor": 0.0005, "type": "local"}}}
opts.pseudo = "entropy"
opts.threshold = 0.001
opts.classif_adaptive_factor = True
opts.init_balanced = True

Arthur Douillard · Answer 8 · Mon Aug 23 2021 20:12:52 GMT+0800 (China Standard Time)

You may want to try

epochs: 10
pod_factor: [0.1, 0.001]
pod logits factor: [0.05, 0.005, 0.0001]

But again, that's just my intuition. If you get significantly better results with PLOP, please tell me about. I'd gladly add your hyperparameters to this repository =)

Sreenivasa Hikkal Venugopala · Answer 9 · Wed Sep 15 2021 02:56:22 GMT+0800 (China Standard Time)

Hello @arthurdouillard, @qjadud1994,

I am facing a similar issue with a different task (4x5s). I was able to run the three tasks described in the paper 19-1, 15-5, 15-1 on the VOC dataset. Even there I had the issue of exploding loss but it was resolved after using the mixed-precision (--opt_level O1).

task:
"4x5s":
{
0: [0, 2, 4, 3, 5, 1],
1: [16, 17, 18, 19, 20],
2: [15, 14, 13, 12, 11],
3: [6, 7, 10, 8, 9]
},

Command:

python -m torch.distributed.launch --nproc_per_node=1 /home/shikka2s/Download/PLOP/CVPR2021_PLOP/run.py --data_root /home/shikka2s/Download/P_VOC/PascalVOC12/ --batch_size 12 --dataset voc --name PLOP5555o --task 4x5s --step 0 --lr 0.01 --epochs 30 --method PLOP --logdir /home/shikka2s/Download/PLOP/PLOP_5555_logs_overlap/ --sample_num 5 --overlap --opt_level O1 --code_directory /home/shikka2s/Download/PLOP/CVPR2021_PLOP/

python -m torch.distributed.launch --nproc_per_node=1 /home/shikka2s/Download/PLOP/CVPR2021_PLOP/run.py --data_root /home/shikka2s/Download/P_VOC/PascalVOC12/ --batch_size 12 --dataset voc --name PLOP5555o --task 4x5s --step 1 --lr 0.001 --epochs 30 --method PLOP --logdir /home/shikka2s/Download/PLOP/PLOP_5555_logs_overlap/ --sample_num 5 --overlap --opt_level O1 --code_directory /home/shikka2s/Download/PLOP/CVPR2021_PLOP/

Error:
Traceback (most recent call last):
File "/home/shikka2s/Download/PLOP/CVPR2021_PLOP/run.py", line 586, in
main(opts)
File "/home/shikka2s/Download/PLOP/CVPR2021_PLOP/run.py", line 157, in main
val_score = run_step(opts, world_size, rank, device)
File "/home/shikka2s/Download/PLOP/CVPR2021_PLOP/run.py", line 421, in run_step
logger=logger
File "/home/shikka2s/Download/PLOP/CVPR2021_PLOP/train.py", line 436, in train
nb_new_classes=self.nb_new_classes
File "/home/shikka2s/Download/PLOP/CVPR2021_PLOP/train.py", line 960, in features_distillation
assert torch.isfinite(layer_loss).all(), layer_loss
AssertionError: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
device='cuda:0', grad_fn=)

For step-0 there was no issue as there is no involvement of knowledge distillation, but from step-1 I received the following warning:

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Warning: NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Warning: NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Warning: NaN or Inf found in input tensor.

The training fails from step-2.

Was the issue solved for you? If yes, could you let me know what was done?

FYI: I followed other issues related to this and cross-checked for installing apex and using the mixed-precision options. But the issue is not solved.

Arthur Douillard · Answer 10 · Wed Nov 03 2021 17:39:31 GMT+0800 (China Standard Time)

I think the LocalPOD loss suffers from some instabilities, and needs gradient clipping. Not sure I can help more.

Arthur Douillard · Answer 11 · Fri Jan 07 2022 21:38:39 GMT+0800 (China Standard Time)

@qjadud1994 @Sreeni1204 : @fcdl94 found the main source of instability, where the classif adaptive factor could be NaN.

I've fixed it, and the code should be more stable I hope: b70fb8f