NVlabs / eg3d

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

torch.multiprocessing.spawn.ProcessRaisedException:

KID-1412-git opened this issue · comments

hi,
when i run train.py on 8 A100 GPU, I meet :
Creating output directory... [137] Launching processes... [138] Loading training set... [139] [140] Num images: 22990 [141] Image shape: [3, 512, 512] [142] Label shape: [25] [143] [144] Constructing networks... [145] Traceback (most recent call last): [146] File "/liudw/head_avatar/eg3d/eg3d/train.py", line 396, in <module> [147] main() # pylint: disable=no-value-for-parameter [148] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ [149] return self.main(*args, **kwargs) [150] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/click/core.py", line 1078, in main [151] rv = self.invoke(ctx) [152] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/click/core.py", line 1434, in invoke [153] return ctx.invoke(self.callback, **ctx.params) [154] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/click/core.py", line 783, in invoke [155] return __callback(*args, **kwargs) [156] File "/liudw/head_avatar/eg3d/eg3d/train.py", line 391, in main [157] launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) [158] File "/liudw/head_avatar/eg3d/eg3d/train.py", line 103, in launch_training [159] torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus) [160] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn [161] return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') [162] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes [163] while not context.join(): [164] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join [165] raise ProcessRaisedException(msg, error_index, failed_process.pid) [166] torch.multiprocessing.spawn.ProcessRaisedException: [167] [168] -- Process 3 terminated with the following error: [169] Traceback (most recent call last): [170] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap [171] fn(i, *args) [172] File "/liudw/head_avatar/eg3d/eg3d/train.py", line 52, in subprocess_fn [173] training_loop.training_loop(rank=rank, **c) [174] File "/liudw/head_avatar/eg3d/eg3d/training/training_loop.py", line 158, in training_loop [175] G = dnnlib.util.construct_class_by_name(**G_kwargs, **common_kwargs).train().requires_grad_(False).to(device) # subclass of torch.nn.Module [176] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 927, in to [177] return self._apply(convert) [178] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply [179] module._apply(fn) [180] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply [181] module._apply(fn) [182] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply [183] module._apply(fn) [184] [Previous line repeated 2 more times] [185] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 602, in _apply [186] param_applied = fn(param) [187] File "/liudw/Myminiconda/envs/eg3d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 925, in convert [188] return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) [189] RuntimeError: CUDA error: uncorrectable ECC error encountered [190] CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. [191] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.