单卡单机训练时出现报错

Question

单卡单机训练时出现报错

AD122583SD opened this issue a year ago · comments

单卡单机训练时
python -m torch.distributed.launch --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py

出现报错
usage: train_amp.py [-h] [--config CONFIG] [--finetune-from FINETUNE_FROM]
train_amp.py: error: unrecognized arguments: --local_rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 24379) of binary: /home/cby/anaconda3/envs/bisenet/bin/python
请问有什么办法解决吗

CoinCheung · Answer 1 · Wed Feb 15 2023 10:20:22 GMT+0800 (China Standard Time)

Try this:

torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py

AtaraxyAdong · Answer 2 · Wed Feb 15 2023 19:22:10 GMT+0800 (China Standard Time)

Hello. I have a similar problem when training on a single machine with a single card,
but when I run torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py, it reports an error of failed to create process..
I've searched for a solution, but it didn't work. I would like to ask if there is any solution to this problem.

AD122583SD · Answer 3 · Thu Feb 16 2023 08:37:10 GMT+0800 (China Standard Time)

Thank you very much for your guidance and help. I haven't been able to solve this problem yet. I will continue to pay attention to this problem and try to solve it 趋离i ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "CoinCheung/BiSeNet" ***@***.***>; 发送时间: 2023年2月15日(星期三) 晚上7:22 ***@***.***>; ***@***.******@***.***>; 主题: Re: [CoinCheung/BiSeNet] 单卡单机训练时出现报错 (Issue #290) Hello. I have a similar problem when training on a single machine with a single card, but when I run torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py, it reports an error of failed to create process.. I've searched for a solution, but it didn't work. I would like to ask if there is any solution to this problem. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

CoinCheung · Answer 4 · Thu Feb 16 2023 19:05:19 GMT+0800 (China Standard Time)

@AtaraxyAdong @AD122583SD What is your platform like please? How did you launch training ? What is the error message like?

Please add CUDA_VISIBLE_DEVICES=0 if your machine has more than one gpus.

AtaraxyAdong · Answer 5 · Fri Feb 17 2023 13:03:34 GMT+0800 (China Standard Time)

@CoinCheung Thanks for your advice, I solved this problem.
The error message was only failed to create process. Later, I realized that the video memory of my machine was too small. After I changed to a machine with a larger video memory, it could run normally.

anizyu · Answer 6 · Mon May 15 2023 10:44:05 GMT+0800 (China Standard Time)

单卡单机训练时
export CUDA_VISIBLE_DEVICES=0
torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py
出现报错
Traceback (most recent call last):
File "tools/train_amp.py", line 268, in
main()
File "tools/train_amp.py", line 264, in main
train()
File "tools/train_amp.py", line 193, in train
optim = set_optimizer(net)
File "tools/train_amp.py", line 70, in set_optimizer
wd_params, nowd_params, lr_mul_wd_params, lr_mul_nowd_params = model.get_params(),
ValueError: not enough values to unpack (expected 4, got 1)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6160) of binary: /home/zy/miniconda3/envs/bise/bin/python
Traceback (most recent call last):
File "/home/zy/miniconda3/envs/bise/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, kwargs)
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train_amp.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-05-15_10:31:29
host : com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 6160)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问有什么办法解决吗?

CoinCheung · Answer 7 · Thu Jun 08 2023 14:58:15 GMT+0800 (China Standard Time)

Sorry for being so late. Have your solved this problem?

AD122583SD · Answer 8 · Fri Jun 09 2023 15:40:34 GMT+0800 (China Standard Time)

Sorry, this question has not been addressed 趋离i ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "CoinCheung/BiSeNet" ***@***.***>; 发送时间: 2023年6月8日(星期四) 下午2:58 ***@***.***>; ***@***.******@***.***>; 主题: Re: [CoinCheung/BiSeNet] 单卡单机训练时出现报错 (Issue #290) Sorry for being so late. Have your solved this problem? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

CoinCheung · Answer 9 · Fri Jun 09 2023 16:15:25 GMT+0800 (China Standard Time)

Then what is your platform like? And what modifications did you made to the original code?

YyChen818 · Answer 10 · Tue Jul 18 2023 17:21:40 GMT+0800 (China Standard Time)

I have the same problem with you. May I ask that did you solve it and how？Hope you reply， thanks！

CoinCheung · Answer 11 · Mon Aug 07 2023 18:42:25 GMT+0800 (China Standard Time)

I am closing this, since no more information is provided, maybe this is no longer a problem now.

单卡单机训练时出现报错

tools/train_amp.py FAILED

Failures: <NO_OTHER_FAILURES>

Failures:
<NO_OTHER_FAILURES>