NaN Loss and learn nothing for all the classes when training from scratch without using pretrained model
wuyujack opened this issue · comments
When I train the MiB from scratch using the VOC dataset, the Loss keeps being nan during the training. Any idea about this issue?
- command:
python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root data --batch_size 12 --dataset voc --name test_MIB_voc_15_5_lr_0.01_no_pretrained --task 15-5 --lr 0.01 --epochs 30 --method MiB --no_pretrained
python -m torch.distributed.launch --nproc_per_node=2 run.py --data_root data --batch_size 12 --dataset voc --name test_MIB_voc_15_5_lr_0.01_no_pretrained --task 15-5 --lr 0.01 --epochs 30 --method MiB --no_pretrained
*****************************************Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.*****************************************
INFO:rank1: Device: cuda:1
INFO:rank0: [!] starting logging at directory ./logs/15-5-voc/test_MIB_voc_15_5_lr_0.01_no_pretrained/
INFO:rank0: Device: cuda:0
INFO:rank0: Dataset: voc, Train set: 8437, Val set: 1240, Test set: 1240, n_classes 16
INFO:rank0: Total batch size is 24
INFO:rank0: Backbone: resnet101
INFO:rank0: [!] Model made without pre-trained
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
Selected optimization level O0: Pure FP32 training.
Defaults for this optimization level are:
enabled : True
opt_level : O0
cast_model_type : torch.float32
patch_torch_functions : False
keep_batchnorm_fp32 : None
master_weights : False
loss_scale : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O0
cast_model_type : torch.float32
patch_torch_functions : False
keep_batchnorm_fp32 : None
master_weights : False
loss_scale : 1.0Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
Warning: apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
INFO:rank0: [!] Train from scratch
INFO:rank1: tensor([[79]])
INFO:rank0: tensor([[79]])
INFO:rank0: Epoch 0, lr = 0.010000
INFO:rank0: Epoch 0, Batch 10/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 20/351, Loss=nan
Warning: NaN or Inf found in input tensor.INFO:rank0: Epoch 0, Batch 30/351, Loss=nanWarning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 40/351, Loss=nanWarning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 50/351, Loss=nanWarning: NaN or Inf found in input tensor.INFO:rank0: Epoch 0, Batch 60/351, Loss=nanWarning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 70/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 80/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 90/351, Loss=nan
Warning: NaN or Inf found in input tensor.INFO:rank0: Epoch 0, Batch 100/351, Loss=nanWarning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 110/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 120/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 130/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 140/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 150/351, Loss=nan
Warning: NaN or Inf found in input tensor.INFO:rank0: Epoch 0, Batch 160/351, Loss=nanWarning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 170/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 180/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 190/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 200/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 210/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 220/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 230/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 240/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 250/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 260/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 270/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 280/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 290/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 300/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 310/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 320/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 330/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 340/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Batch 350/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 0, Class Loss=nan, Reg Loss=0.0
INFO:rank0: End of Epoch 0/30, Average Loss=nan, Class Loss=nan, Reg Loss=0.0
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
INFO:rank0: validate on val set...
INFO:rank0: Validation, Class Loss=nan, Reg Loss=0.0 (without scaling)
INFO:rank1: Done validation
INFO:rank0: Done validation
INFO:rank0: End of Validation 0/30, Validation Loss=nan, Class Loss=nan, Reg Loss=0.0
INFO:rank0:
Total samples: 1240.000000
Overall Acc: 0.694367
Mean Acc: 0.062500
FreqW Acc: 0.482146
Mean IoU: 0.043398
Class IoU:
class 0: 0.6943672262601215
class 1: 0.0
class 2: 0.0
class 3: 0.0
class 4: 0.0
class 5: 0.0
class 6: 0.0
class 7: 0.0
class 8: 0.0
class 9: 0.0
class 10: 0.0
class 11: 0.0
class 12: 0.0
class 13: 0.0
class 14: 0.0
class 15: 0.0
Class Acc:
class 0: 0.9999999999999951
class 1: 0.0
class 2: 0.0
class 3: 0.0
class 4: 0.0
class 5: 0.0
class 6: 0.0
class 7: 0.0
class 8: 0.0
class 9: 0.0
class 10: 0.0
class 11: 0.0
class 12: 0.0
class 13: 0.0
class 14: 0.0
class 15: 0.0
INFO:rank0: [!] Checkpoint saved.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, lr = 0.009699
INFO:rank0: Epoch 1, Batch 10/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 20/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 30/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 40/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 50/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 60/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 70/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 80/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 90/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 100/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 110/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 120/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 130/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 140/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 150/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 160/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 170/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 180/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 190/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 200/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 210/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 220/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 230/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 240/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 250/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 260/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 270/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 280/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 290/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 300/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 310/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 320/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 330/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 340/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Batch 350/351, Loss=nan
Warning: NaN or Inf found in input tensor.
INFO:rank0: Epoch 1, Class Loss=nan, Reg Loss=0.0
INFO:rank0: End of Epoch 1/30, Average Loss=nan, Class Loss=nan, Reg Loss=0.0
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
INFO:rank0: validate on val set...
INFO:rank1: Done validation
INFO:rank0: Validation, Class Loss=nan, Reg Loss=0.0 (without scaling)
INFO:rank0: Done validation
INFO:rank0: End of Validation 1/30, Validation Loss=nan, Class Loss=nan, Reg Loss=0.0
INFO:rank0:
Total samples: 1240.000000
Overall Acc: 0.694367
Mean Acc: 0.062500
FreqW Acc: 0.482146
Mean IoU: 0.043398
Class IoU:
class 0: 0.6943672262601215
class 1: 0.0
class 2: 0.0
class 3: 0.0
class 4: 0.0
class 5: 0.0
class 6: 0.0
class 7: 0.0
class 8: 0.0
class 9: 0.0
class 10: 0.0
class 11: 0.0
class 12: 0.0
class 13: 0.0
class 14: 0.0
class 15: 0.0
Class Acc:
class 0: 0.9999999999999951
class 1: 0.0
class 2: 0.0
class 3: 0.0
class 4: 0.0
class 5: 0.0
class 6: 0.0
class 7: 0.0
class 8: 0.0
class 9: 0.0
class 10: 0.0
class 11: 0.0
class 12: 0.0
class 13: 0.0
class 14: 0.0
class 15: 0.0
BTW, could you provide the Linux kernel version and also the gcc version? Since you can see that the apex installation is not fully successful, therefore I want to know more about your environment setting for installing the apex.
Fixed the problem after successfully installing the apex
, the loss is no more nan
.
Glad you solved.
In case it may be useful for anyone else, I just installed Apex from the Nvidia repository https://github.com/NVIDIA/apex.
I'm using the latest version, which is 0.1.
Hope to change to Pytorch 1.6 and default Torch.apex in the future.