modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with training part.

NathanJHLee opened this issue · comments

Hi, I am Nathan and i am facing some problem with training part.

My env
Centos7.5
#PIP
pytorch-wpe 0.0.1
rotary-embedding-torch 0.5.3
torch 1.12.1+cu113 //To use cuda, I did reinstall torch and torchaudio.
torch-complex 0.4.3
torchaudio 0.12.1+cu113
torchvision 0.13.1+cu113

#rpm
libcudnn8-devel-8.2.0.53-1.cuda11.3.x86_64
libcudnn8-8.2.0.53-1.cuda11.3.x86_64

libnccl-devel-2.9.9-1+cuda11.3.x86_64
libnccl-2.9.9-1+cuda11.3.x86_64

To run a script , I follow 'egs/voxceleb/sv-ecapa/run.sh'
I set 4 gpus. (When i set single gpu, It's not working too)
But I got error as below.

Stage3: Training the speaker model...
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2024-02-15 14:31:58,001 - INFO: Use GPU: 3 for training.
2024-02-15 14:31:58,003 - INFO: Use GPU: 2 for training.
2024-02-15 14:31:58,009 - INFO: Use GPU: 1 for training.
2024-02-15 14:31:58,011 - INFO: Use GPU: 0 for training.
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 121550 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 121547) of binary: /home/asr/miniconda3/envs/3D-Speaker/bin/python
Traceback (most recent call last):
File "/home/asr/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in
sys.exit(main())
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures:
[1]:
time : 2024-02-15_14:32:03
host : e7bcf3a85e2c
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 121548)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-02-15_14:32:03
host : e7bcf3a85e2c
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 121549)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-02-15_14:32:03
host : e7bcf3a85e2c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 121547)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I successfully re-cloned the repository and executed the run.sh script without encountering any errors. The versions of PyTorch and CUDA installed on my system are 1.12.0 and 10.2, respectively.

You must attempt the following steps:

  1. Verify the execution permissions for the Python script to ensure it is executable.
  2. The speakerlab within the run.sh directory is a symbolic link. Consider copying the file directory that 3D-Speaker/speakerlab points to and replace the symbolic link with the actual directory.

Hi, I have one more question.

I believe i solve problem with nccl.
But I got another error.

I encounter error 'model = torch.nn.parallel.DistributedDataParallel(model)' in train.py
It's mismatch tensor size.
I think '192' is embedding_size according to ecapa_tdnn.yaml.
please check my error log.
Thank you.

Error log is here when i try to use single gpu.
Stage3: Training the speaker model...
2024-02-22 18:15:58,831 - INFO: Use GPU: 1 for training.
d5acf849f4d8:167887:167887 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
d5acf849f4d8:167887:167887 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

d5acf849f4d8:167887:167887 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
d5acf849f4d8:167887:167887 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
d5acf849f4d8:167887:167887 [1] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.1
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 00/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 01/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 02/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 03/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 04/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 05/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 06/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 07/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 08/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 09/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 10/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 11/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 12/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 13/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 14/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 15/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 16/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 17/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 18/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 19/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 20/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 21/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 22/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 23/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 24/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 25/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 26/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 27/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 28/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 29/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 30/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 31/32 : 0
d5acf849f4d8:167887:167958 [1] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
d5acf849f4d8:167887:167958 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
d5acf849f4d8:167887:167958 [1] NCCL INFO Connected all rings
d5acf849f4d8:167887:167958 [1] NCCL INFO Connected all trees
d5acf849f4d8:167887:167958 [1] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
d5acf849f4d8:167887:167958 [1] NCCL INFO comm 0x7fa6b0002010 rank 0 nranks 1 cudaDev 1 busId 13000 - Init COMPLETE
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 193, in
main()
File "speakerlab/bin/train.py", line 70, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 580, in init
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 597, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(

RuntimeError: The size of tensor a (192) must match the size of tensor b (0) at non-singleton dimension 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 167887) of binary: /home/asr/miniconda3/envs/3D-Speaker/bin/python
Traceback (most recent call last):
File "/home/asr/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in
sys.exit(main())
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-02-22_18:16:07
host : d5acf849f4d8
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 167887)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html