open-mmlab / mmyolo

OpenMMLab YOLO series toolbox and benchmark. Implemented RTMDet, RTMDet-Rotated,YOLOv5, YOLOv6, YOLOv7, YOLOv8,YOLOX, PPYOLOE, etc.

Home Page:https://mmyolo.readthedocs.io/zh_CN/dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The problem of loading pre-trained models in multiple GPU training

bichunyang419 opened this issue · comments

Prerequisite

🐞 Describe the bug

bash tools/dist_train.sh configs/yolov5/yolov5_s-v61_syncbn_fast_1xb4-300e_balloon.py 2 --cfg-options load_from='yolov5_s-v61_syncbn_fast_8xb16-300e_coco_20220918_084700-86e02187.pth' model.backbone.frozen_stages=4
outputs:
11/07 16:55:40 - mmengine - INFO - Epoch(train) [1][1/2730] lr: 0.0000e+00 eta: 81 days, 10:12:59 time: 8.5900 data_time: 7.5419 memory: 2460 loss: 8.3048 loss_cls: 5.7765 loss_obj: 0.5974 loss_bbox: 1.9309
Traceback (most recent call last):
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 106, in
main()
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 102, in main
runner.train()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1661, in train
model = self.train_loop.run() # type: ignore
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 90, in run
self.run_epoch()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 106, in run_epoch
self.run_iter(idx, data_batch)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 122, in run_iter
outputs = self.runner.model.train_step(
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 106, in
main()
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 102, in main
runner.train()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1661, in train
model = self.train_loop.run() # type: ignore
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 90, in run
self.run_epoch()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 106, in run_epoch
self.run_iter(idx, data_batch)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 122, in run_iter
outputs = self.runner.model.train_step(
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12373) of binary: /home/lab532/anaconda3/bin/python
Traceback (most recent call last):
File "/home/lab532/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lab532/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:
[1]:
time : 2022-11-07_16:55:46
host : lab532-All-Series
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 12374)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2022-11-07_16:55:46
host : lab532-All-Series
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 12373)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

fatal: not a git repository (or any parent up to mount point /media)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda:/usr/local/cuda
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.2+cu111
OpenCV: 4.6.0
MMEngine: 0.3.0
MMCV: 2.0.0rc2
MMDetection: 3.0.0rc2
MMYOLO: 0.1.2+

Additional information

No response

@bichunyang419 Please setting find_unused_parameters=True in config