Giters
NVIDIA
/
nccl-tests
NCCL Tests
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
705
Watchers:
28
Issues:
186
Forks:
214
NVIDIA/nccl-tests Issues
Multi node test hang phenomenon
Updated
3 months ago
How is the maximum number of bytes for all_reduce operation calculated?
Updated
3 months ago
Comments count
3
Interaction between NCCL_IB_SL and NCCL_IB_ADAPTIVE_ROUTING
Updated
3 months ago
How to explain Bus Bandwidth in Allreduce Operation?
Updated
3 months ago
busbw exceeds network bandwidth (2 nodes, 16 gpus, 100Gbps intel NIC, no NVSwitch) - what algorithm is used?
Closed
4 months ago
Comments count
2
undefined reference to ncclRedOpDestroy
Updated
4 months ago
Comments count
2
all_reduce_perf between NVLINK connected H100 PCIe GPUs lower than A100 SXM4 GPUs
Updated
4 months ago
NCCL Test hang when the number of nodes goes beyond 18, and CPU usage is very high
Updated
5 months ago
Comments count
1
hypercube out-of-bound errors with single-proc + `gpus-per-thread=4`, not with multi-proc + `gpus-per-thread=1`
Updated
5 months ago
Comments count
1
NCCL Test Does not work with GID 3 or GID 1, but it works fine for GID 0
Updated
5 months ago
A100 - All reduce performance
Updated
5 months ago
Comments count
1
nccl-tests result is only a half of ib_write_bw
Updated
5 months ago
misc/socket.cc:441 NCCL WARN socketFinalizeAccept: wrong type 4 != 3
Closed
5 months ago
Comments count
6
NCCL alltoall_perf hangs via PXN
Closed
5 months ago
Comments count
1
Expected bandwidth results? 8x A100 GPUs over NVLink
Updated
6 months ago
Comments count
9
Bandwidth result not equal to ib_write_bw result
Closed
10 months ago
Comments count
3
misc/ibvwrap.cc:187 NCCL WARN Call to ibv_modify_qp failed with error Network is unreachable
Updated
6 months ago
Comments count
3
all_reduce_perf(--op='sum') get wrong results when size is over specific value
Closed
6 months ago
Comments count
9
how can i run nccl-test use max bandwidth
Updated
6 months ago
Test NCCL failure common.cu:954 'unhandled cuda error" when test on >2 GPUs
Closed
6 months ago
Comments count
4
Although it is an InfiniBand environment, it seems that the average Bandwidth is not as good as expected.
Updated
6 months ago
Comments count
4
Test NCCL failure common.cu:958 'internal error - please report this issue to the NCCL developers / '
Closed
6 months ago
Comments count
10
Nsight Profiling: one ncclAllReduce takes too long
Updated
6 months ago
nccl-test is throwing timeout error on two nodes
Updated
6 months ago
Comments count
26
AlltoAllGetBw is incorrect when used with multiple nodes
Updated
6 months ago
Comments count
1
./build/all_reduce_perf between nodes failed
Updated
6 months ago
Comments count
1
bus error
Closed
7 months ago
Comments count
3
Two A800 nodes cannot reach ideal all-reduce performance
Updated
7 months ago
Comments count
18
what does error in nccl-test output represent?
Updated
7 months ago
Comments count
3
Two A100 nodes cannot reach ideal all-reduce performance
Updated
7 months ago
Comments count
4
Test in dockers of multi-node
Updated
7 months ago
Issue Running NCCL Tests on Gentoo with Varying GPU Availability: CUDA failure common.cu:892 'invalid device ordinal'
Closed
7 months ago
Comments count
3
When I am running on multiple nodes, I can get the corresponding results when running on 3 nodes, and an exception will occur when more than 3 nodes are executed.
Updated
7 months ago
Comments count
3
Why need more than one iteration to check data?
Closed
7 months ago
Comments count
4
No explanation on BusBW factor regarding alltoall in docs
Updated
7 months ago
if the bandwidth results of the Nccl test are related to the number of nodes?
Updated
8 months ago
Comments count
2
unhandled cuda error during test
Closed
8 months ago
Comments count
1
Calculating "net_bw" in addition to "bus_bw"
Updated
9 months ago
Test CUDA failure common.cu:892 'invalid device ordinal'
Closed
10 months ago
Comments count
10
when i am running this command : mpirun -np 1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2. I found this
Updated
10 months ago
Comments count
2
Nccl test fails on 8 x V100- misc/socket.cc:483 NCCL WARN socketStartConnect: Connect to xxx failed : Software caused connection abort
Closed
10 months ago
Comments count
9
Origin of Poor Internode NCCL Performance
Closed
10 months ago
Comments count
11
what do algobw actually mean when I run test with more than one nodes?speed between nodes or speed between gpus.
Closed
10 months ago
Comments count
3
`busbw` does not reflect the speed of hardware bottleneck in H800
Updated
10 months ago
Comments count
7
The difference between algbw and busbw
Updated
10 months ago
Debugging with cuda-gdb causes problems
Updated
10 months ago
all_reduce_perf fails on 2 nodes
Closed
a year ago
Comments count
2
test error: stuck when run test example
Updated
a year ago
Comments count
4
question regarding versioning
Closed
a year ago
[91mnvcc fatal : Unsupported gpu architecture 'compute_35' [0m[91mmake[1]: *** [Makefile:84: ../build/all_reduce.o] Error 1 for nvcr.io/nvidia/pytorch:23.02-py3
Closed
a year ago
Comments count
2
Previous
Next