CoinCheung / BiSeNet

Add bisenetv2. My implementation of BiSeNet

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

单卡单机训练时出现报错

AD122583SD opened this issue · comments

单卡单机训练时
python -m torch.distributed.launch --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py

出现报错
usage: train_amp.py [-h] [--config CONFIG] [--finetune-from FINETUNE_FROM]
train_amp.py: error: unrecognized arguments: --local_rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 24379) of binary: /home/cby/anaconda3/envs/bisenet/bin/python
请问有什么办法解决吗

Try this:

torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py

Hello. I have a similar problem when training on a single machine with a single card,
but when I run torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py, it reports an error of failed to create process..
I've searched for a solution, but it didn't work. I would like to ask if there is any solution to this problem.

@AtaraxyAdong @AD122583SD What is your platform like please? How did you launch training ? What is the error message like?

Please add CUDA_VISIBLE_DEVICES=0 if your machine has more than one gpus.

@CoinCheung Thanks for your advice, I solved this problem.
The error message was only failed to create process. Later, I realized that the video memory of my machine was too small. After I changed to a machine with a larger video memory, it could run normally.

commented

单卡单机训练时
export CUDA_VISIBLE_DEVICES=0
torchrun --nproc_per_node=1 tools/train_amp.py --config configs/bisenetv2_city.py
出现报错
Traceback (most recent call last):
File "tools/train_amp.py", line 268, in
main()
File "tools/train_amp.py", line 264, in main
train()
File "tools/train_amp.py", line 193, in train
optim = set_optimizer(net)
File "tools/train_amp.py", line 70, in set_optimizer
wd_params, nowd_params, lr_mul_wd_params, lr_mul_nowd_params = model.get_params(),
ValueError: not enough values to unpack (expected 4, got 1)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6160) of binary: /home/zy/miniconda3/envs/bise/bin/python
Traceback (most recent call last):
File "/home/zy/miniconda3/envs/bise/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zy/miniconda3/envs/bise/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train_amp.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-05-15_10:31:29
host : com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 6160)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问有什么办法解决吗?

Sorry for being so late. Have your solved this problem?

Then what is your platform like? And what modifications did you made to the original code?

I have the same problem with you. May I ask that did you solve it and how?Hope you reply, thanks!

I am closing this, since no more information is provided, maybe this is no longer a problem now.