language identification training problem
fclearner opened this issue · comments
大佬们,你们在训练语种识别的时候会不会老是挂掉,我这边每训练一轮就挂一次,我尝试改了:
(1)timeout增大:dist.init_process_group(backend='nccl', timeout=datetime.timedelta(seconds=7200000));
(2)batch_size调小;
(2)num_workers调成0;
还是报错了:
python3.8
torch版本是2.2.1+cu118
cuda11.6
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 182, in <module>
main()
File "speakerlab/bin/train.py", line 94, in main
train_stats = train(
File "speakerlab/bin/train.py", line 159, in train
writer = SummaryWriter(tensorboard_dir)
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/writer.py", line 300, in __init__
self._get_file_writer()
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/writer.py", line 348, in _get_file_writer
self.file_writer = FileWriter(logdir=self.logdir,
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/writer.py", line 104, in __init__
self.event_writer = EventFileWriter(
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 113, in __init__
self._worker.start()
File "miniconda3/envs/3D-Speaker/lib/python3.8/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
[2024-03-05 16:26:57,461] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2946 closing signal SIGTERM
[2024-03-05 16:26:59,546] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2945) of binary: miniconda3/envs/3D-Speaker/bin/python
Traceback (most recent call last):
File "miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in <module>
sys.exit(main())
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
speakerlab/bin/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-05_16:26:57
host : gpu02
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2945)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
看报错好像是writer = SummaryWriter(tensorboard_dir)代码出错,你修改train.py代码了吗?训练语种识别的时候一般不会挂掉,要么你重新git clone尝试一下?
看报错好像是writer = SummaryWriter(tensorboard_dir)代码出错,你修改train.py代码了吗?训练语种识别的时候一般不会挂掉,要么你重新git clone尝试一下?
嗯嗯,我确实加了一个日志打印的代码,我没怀疑到这上面,感谢大佬,我先去调试看看