How to solve the problem of loss NAN?
xiaoxiAries opened this issue · comments
Hi,
I reprocess this code on SSv2 datatset. I follow the blr 0.1, and use 2 gpus with batchsize of 7 (the valid total batchsize is 14). But the loss is Nan when epoch is 14. How to solve this problem? Thanks~
Hi,
Which configuration do you use? Full-tuning baseline or AdaptFormer?
Hi,
I follow this configuration:
OMP_NUM_THREADS=1 python3 -m torch.distributed.launch
--nproc_per_node=2
--use_env main_video.py
--finetune /path/to/pre_trained/mae.pyth
--output_dir /path/to/output
--batch_size 7 --epochs 90 --blr 0.1 --weight_decay 0.0 --dist_eval
--data_path /path/to/SSV2 --data_set SSV2
--ffn_adapt
I am sorry I didn't experiment with your specific configuration. Reduce the learning rate and have a try.