Model train execution fails with a PyTorch error message.

Question

Model train execution fails with a PyTorch error message.

shivamsnaik opened this issue 3 years ago · comments

Hello,
I would like to ask if anyone is facing the below issue:
TypeError: add(): argument 'alpha' must be Number, not NoneType

The steps I followed are:

python -m pip install -e DynamicHead.
Added custom dataset using register_coco_instance
Update config in def setup(args) with custom dataset name and the number of classes:
cfg.DATASETS.TRAIN = ('coco_docbank_train',) cfg.MODEL.ROI_HEADS.NUM_CLASSES = 13
Run the model:
python train_net.py --config configs/dyhead_r50_rcnn_fpn_1x.yaml --num-gpus 1.

The detailed error message goes like this:

File "train_net.py", line 204, in <module> 
    launch(
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "train_net.py", line 198, in main
    return trainer.train()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
    super().train(self.start_iter, self.max_iter)
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()
  File "/opt/conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 294, in run_step
    self.optimizer.step()
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 136, in step
    F.sgd(params_with_grad,
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/_functional.py", line 164, in sgd
    d_p = d_p.add(param, alpha=weight_decay)
TypeError: add(): argument 'alpha' must be Number, not NoneType

Environment Details:
sys.platform = linux
Python = 3.8.12
numpy = 1.21.2
detectron2 = 0.6
CUDA = 11.4
PyTorch = 1.10.0
torchvision = 0.11.0a0
fvcore = 0.1.5.post20211023
iopath = 0.1.9
cv2 = 4.5.4

Kindly request for assistance if you are aware of the solution.

MajorityRreport · Answer 1 · Thu Dec 30 2021 15:47:58 GMT+0800 (China Standard Time)

Hi,I have the same problem as you. Have you solved it?

Houseqin · Answer 2 · Thu Jan 06 2022 20:41:37 GMT+0800 (China Standard Time)

Hi,I have the same problem as you. Have you solved it?

i have the same problem, have you solved it?

Shivam Naik · Answer 3 · Thu Jan 06 2022 20:49:07 GMT+0800 (China Standard Time)

@Houseqin @MajorityRreport Hi. Yes I did solve the issue.

The weight decay is not passed to Pytorch due to which the above error occurs.
Add the following lines to your config YAML file to solve this issue:
cfg.SOLVER.WEIGHT_DECAY_BIAS = <desired_value>

This was causing the issue in my case. Also, if cfg.SOLVER.WEIGHT_DECAY is missing in your config, do include that.

It would surely solve the issue.

MajorityRreport · Answer 4 · Thu Jan 06 2022 20:51:39 GMT+0800 (China Standard Time)

@Houseqin @MajorityRreport Hi. Yes I did solve the issue.

The weight decay is not passed to Pytorch due to which the above error occurs. Add the following lines to your config YAML file to solve this issue: cfg.SOLVER.WEIGHT_DECAY_BIAS = <desired_value>

This was causing the issue in my case. Also, if cfg.SOLVER.WEIGHT_DECAY is missing in your config, do include that.

It would surely solve the issue.

Thank you very much!!

Shivam Naik · Answer 5 · Thu Jan 06 2022 20:53:56 GMT+0800 (China Standard Time)

@MajorityRreport Glad to help. Let me know if it still throws the same error.

Houseqin · Answer 6 · Thu Jan 06 2022 21:47:40 GMT+0800 (China Standard Time)

@MajorityRreport Glad to help. Let me know if it still throws the same error.

it also has the same problem with the official config YAML dyhead_swint_atss_fpn_2x_ms.yaml，although with adding cfg.SOLVER.WEIGHT_DECAY_BIAS and cfg.SOLVER.WEIGHT_DECAY

Shivam Naik · Answer 7 · Thu Jan 06 2022 23:08:23 GMT+0800 (China Standard Time)

@Houseqin I have never seen this error before. Is it hindering with normal operation?.

If yes, you can try replicating my versions for CUDA, Pytorch, etc that Ive mentioned in the issue description.
It works perfectly for me after the mentioned changes.

Houseqin · Answer 8 · Fri Jan 07 2022 16:13:31 GMT+0800 (China Standard Time)

@Houseqin I have never seen this error before. Is it hindering with normal operation?.

If yes, you can try replicating my versions for CUDA, Pytorch, etc that Ive mentioned in the issue description. It works perfectly for me after the mentioned changes.

I have found it a compile problem, and fix the issue "nvcc not found" or "Not compiled with GPU support" or "Detectron2 CUDA Compiler: not available" according to https://detectron2.readthedocs.io/en/latest/tutorials/install.html#common-installation-issues

Thank you most sincerely.

Edwardmark · Answer 9 · Mon Mar 28 2022 15:13:16 GMT+0800 (China Standard Time)

cfg.SOLVER.WEIGHT_DECAY_BIAS and cfg.SOLVER.WEIGHT_DECAY

hi could you please tell me what value shoud be set to this decay? Thanks.

MajorityRreport · Answer 10 · Mon Mar 28 2022 15:44:48 GMT+0800 (China Standard Time)

WEIGHT_DECAY: 0.05
WEIGHT_DECAY_BIAS: 0.01
Sorry,I'm just a beginner. I set it like this(a random number),and the network could run .But maybe the network didn't match my data set, so it didn't work well.
I hope it helps