The problem of loading pre-trained models in multiple GPU training

Question

The problem of loading pre-trained models in multiple GPU training

bichunyang419 opened this issue a year ago · comments

bichunyang419 commented a year ago

Prerequisite

I have searched the existing and past issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

🐞 Describe the bug

bash tools/dist_train.sh configs/yolov5/yolov5_s-v61_syncbn_fast_1xb4-300e_balloon.py 2 --cfg-options load_from='yolov5_s-v61_syncbn_fast_8xb16-300e_coco_20220918_084700-86e02187.pth' model.backbone.frozen_stages=4
outputs:
11/07 16:55:40 - mmengine - INFO - Epoch(train) [1][1/2730] lr: 0.0000e+00 eta: 81 days, 10:12:59 time: 8.5900 data_time: 7.5419 memory: 2460 loss: 8.3048 loss_cls: 5.7765 loss_obj: 0.5974 loss_bbox: 1.9309
Traceback (most recent call last):
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 106, in
main()
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 102, in main
runner.train()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1661, in train
model = self.train_loop.run() # type: ignore
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 90, in run
self.run_epoch()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 106, in run_epoch
self.run_iter(idx, data_batch)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 122, in run_iter
outputs = self.runner.model.train_step(
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 106, in
main()
File "/media/sdb1/bcy/code/mmyolo-main0.1.2/tools/train.py", line 102, in main
runner.train()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/runner.py", line 1661, in train
model = self.train_loop.run() # type: ignore
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 90, in run
self.run_epoch()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 106, in run_epoch
self.run_iter(idx, data_batch)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/runner/loops.py", line 122, in run_iter
outputs = self.runner.model.train_step(
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/home/lab532/anaconda3/lib/python3.9/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(data, mode=mode)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12373) of binary: /home/lab532/anaconda3/bin/python
Traceback (most recent call last):
File "/home/lab532/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lab532/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lab532/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:
[1]:
time : 2022-11-07_16:55:46
host : lab532-All-Series
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 12374)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2022-11-07_16:55:46
host : lab532-All-Series
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 12373)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

fatal: not a git repository (or any parent up to mount point /media)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0]
CUDA available: True
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA GeForce GTX 1080 Ti
CUDA_HOME: /usr/local/cuda:/usr/local/cuda
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.10.1+cu111
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.2+cu111
OpenCV: 4.6.0
MMEngine: 0.3.0
MMCV: 2.0.0rc2
MMDetection: 3.0.0rc2
MMYOLO: 0.1.2+

Additional information

No response

Haian Huang(深度眸) · Answer 1 · Mon Nov 07 2022 17:30:58 GMT+0800 (China Standard Time)

@bichunyang419 Please setting find_unused_parameters=True in config

The problem of loading pre-trained models in multiple GPU training

Prerequisite

🐞 Describe the bug

tools/train.py FAILED

Failures: [1]: time : 2022-11-07_16:55:46 host : lab532-All-Series rank : 1 (local_rank: 1) exitcode : 1 (pid: 12374) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Environment

Additional information

Failures:
[1]:
time : 2022-11-07_16:55:46
host : lab532-All-Series
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 12374)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html