modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

language identification training problem

fclearner opened this issue · comments

大佬们,你们在训练语种识别的时候会不会老是挂掉,我这边每训练一轮就挂一次,我尝试改了:

(1)timeout增大:dist.init_process_group(backend='nccl', timeout=datetime.timedelta(seconds=7200000));
(2)batch_size调小;
(2)num_workers调成0;
还是报错了:

python3.8
torch版本是2.2.1+cu118
cuda11.6

Traceback (most recent call last):
  File "speakerlab/bin/train.py", line 182, in <module>
    main()
  File "speakerlab/bin/train.py", line 94, in main
    train_stats = train(
  File "speakerlab/bin/train.py", line 159, in train
    writer = SummaryWriter(tensorboard_dir)
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/writer.py", line 300, in __init__
    self._get_file_writer()
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/writer.py", line 348, in _get_file_writer
    self.file_writer = FileWriter(logdir=self.logdir,
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/writer.py", line 104, in __init__
    self.event_writer = EventFileWriter(
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/tensorboardX/event_file_writer.py", line 113, in __init__
    self._worker.start()
  File "miniconda3/envs/3D-Speaker/lib/python3.8/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
[2024-03-05 16:26:57,461] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2946 closing signal SIGTERM
[2024-03-05 16:26:59,546] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2945) of binary: miniconda3/envs/3D-Speaker/bin/python
Traceback (most recent call last):
  File "miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
speakerlab/bin/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-05_16:26:57
  host      : gpu02
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2945)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

看报错好像是writer = SummaryWriter(tensorboard_dir)代码出错,你修改train.py代码了吗?训练语种识别的时候一般不会挂掉,要么你重新git clone尝试一下?

看报错好像是writer = SummaryWriter(tensorboard_dir)代码出错,你修改train.py代码了吗?训练语种识别的时候一般不会挂掉,要么你重新git clone尝试一下?

嗯嗯,我确实加了一个日志打印的代码,我没怀疑到这上面,感谢大佬,我先去调试看看