Incurred problem during training on 4090

Question

Incurred problem during training on 4090

ZhenshengWu opened this issue 4 months ago · comments

CUDA version
Thu Mar 28 13:09:21 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Graphics... On | 00000000:17:00.0 Off | Off |
| 66% 28C P8 25W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Graphics... On | 00000000:18:00.0 Off | Off |
| 66% 33C P8 27W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA Graphics... On | 00000000:31:00.0 Off | Off |
| 66% 29C P8 23W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA Graphics... On | 00000000:32:00.0 Off | Off |
| 65% 29C P8 18W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA Graphics... On | 00000000:4B:00.0 Off | Off |
| 68% 28C P8 22W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA Graphics... On | 00000000:67:00.0 Off | Off |
| 66% 32C P8 29W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA Graphics... On | 00000000:98:00.0 Off | Off |
| 63% 35C P8 17W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA Graphics... On | 00000000:E3:00.0 Off | Off |
| 66% 32C P8 25W / 450W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

run command like:
python3 -m torch.distributed.run --nproc_per_node 8 --master_port 29519 train.py --sync-bn --cfg cfg_fire_and_smoke/yolov7_fire_smoke.yaml --data cfg_fire_and_smoke/fire_smoke_data.yaml --img-size 640 --batch-size 16 --weights '' --device 0,1,2,3,4,5,6,7
this would be run well with anthor GPU

i got error：
/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Model Summary: 415 layers, 37201950 parameters, 37201950 gradients, 105.1 GFLOPS

/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1678402421473/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Traceback (most recent call last):
File "train.py", line 616, in
train(hyp, opt, device, tb_writer)
File "train.py", line 95, in train
model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
Traceback (most recent call last):
File "train.py", line 616, in
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
train(hyp, opt, device, tb_writer)
File "train.py", line 95, in train
model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
param_applied = fn(param)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

return self._apply(convert)

File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "train.py", line 616, in
train(hyp, opt, device, tb_writer)
File "train.py", line 95, in train
model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last):
File "train.py", line 616, in
Traceback (most recent call last):
File "train.py", line 616, in
train(hyp, opt, device, tb_writer)
File "train.py", line 95, in train
train(hyp, opt, device, tb_writer)
File "train.py", line 95, in train
model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
return self._apply(convert)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
param_applied = fn(param)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
Traceback (most recent call last):
File "train.py", line 616, in
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

train(hyp, opt, device, tb_writer)

File "train.py", line 95, in train
model = Model(opt.cfg, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13378) of binary: /root/anaconda3/envs/test/bin/python3
Traceback (most recent call last):
File "/root/anaconda3/envs/test/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/test/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/run.py", line 798, in
main()
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/test/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2024-03-28_13:15:41
host : root123
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 13379)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-03-28_13:15:41
host : root123
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 13380)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-03-28_13:15:41
host : root123
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 13381)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2024-03-28_13:15:41
host : root123
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 13382)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2024-03-28_13:15:41
host : root123
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 13383)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2024-03-28_13:15:41
host : root123
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 13384)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
time : 2024-03-28_13:15:41
host : root123
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 13385)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-03-28_13:15:41
host : root123
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 13378)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I do not know why i got this，how should I solve this problem, thanks very much

Mayesh Mohapatra · Answer 1 · Thu May 16 2024 01:30:20 GMT+0800 (China Standard Time)

The error occurs because you ran out of memory on your GPU, One way to solve it is to reduce the batch size until your code runs without this error.

Incurred problem during training on 4090

train.py FAILED

Root Cause (first observed failure): [0]: time : 2024-03-28_13:15:41 host : root123 rank : 0 (local_rank: 0) exitcode : 1 (pid: 13378) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-03-28_13:15:41
host : root123
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 13378)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html