How to solve the problem of loss NAN?

Question

How to solve the problem of loss NAN?

xiaoxiAries opened this issue 2 years ago · comments

Hi,

I reprocess this code on SSv2 datatset. I follow the blr 0.1, and use 2 gpus with batchsize of 7 (the valid total batchsize is 14). But the loss is Nan when epoch is 14. How to solve this problem? Thanks~

Shoufa Chen · Answer 1 · Fri Oct 07 2022 08:28:00 GMT+0800 (China Standard Time)

Hi,

Which configuration do you use? Full-tuning baseline or AdaptFormer?

xiaoxiAries · Answer 2 · Sat Oct 08 2022 15:19:07 GMT+0800 (China Standard Time)

Hi，
I follow this configuration:

OMP_NUM_THREADS=1 python3 -m torch.distributed.launch
--nproc_per_node=2
--use_env main_video.py
--finetune /path/to/pre_trained/mae.pyth
--output_dir /path/to/output
--batch_size 7 --epochs 90 --blr 0.1 --weight_decay 0.0 --dist_eval
--data_path /path/to/SSV2 --data_set SSV2
--ffn_adapt

Shoufa Chen · Answer 3 · Sun Oct 09 2022 11:11:34 GMT+0800 (China Standard Time)

I am sorry I didn't experiment with your specific configuration. Reduce the learning rate and have a try.