nccl

There are 0 repository under nccl topic.

cupy / cupy
NumPy & SciPy for GPU
cublas cuda cudnn cupy curand cusolver cusparse cusparselt cutensor gpu nccl numpy nvrtc nvtx python rocm scipy tensor
Language:Python 10597
coreylowman / cudarc
Safe rust wrapper around CUDA toolkit
cuda cuda-programming cuda-toolkit gpu gpu-acceleration rust cublas curand cuda-kernels nvrtc cudnn nccl
Language:Rust 963
huggingface / llm_training_handbook
An open collection of methodologies to help with successful training of large language models.
cuda large-language-models llm nccl nlp performance python pytorch scalability troubleshooting
Language:Python 538
LambdaLabsML / distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
cuda deepspeed distributed-training gpu gpu-cluster kuberentes nccl pytorch slurm cluster fsdp lambdalabs mpi sharding
Language:Python 530
huggingface / large_language_model_training_playbook
An open collection of implementation tips, tricks and resources for training large language models
cuda llm nccl nlp performance python pytorch scalability troubleshooting large-language-models
Language:Python 481
FZJ-JSC / tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
hpc mpi exascale-computing gpu multi-gpu sc21 supercomputing nccl nvshmem cuda isc22 sc22 sc23 isc23 isc24 isc25 sc24
Language:Cuda 318
Bluefog-Lib / bluefog
Distributed and decentralized training framework for PyTorch over graph
asynchronous decentralized deeplearning distributed-computing machine-learning mpi nccl one-sided pytorch
Language:Python 253
microsoft / msrflute
Federated Learning Utilities and Tools for Experimentation
federated-learning distributed-learning gloo machine-learning nccl personalization privacy-tools pytorch simulation transformers-models
Language:Python 191
google / nccl-fastsocket
NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.
nccl training machine-learning
Language:C++ 120
openhackathons-org / nways_multi_gpu
N-Ways to Multi-GPU Programming
cuda hpc mpi nccl nsight-systems nvshmem
Language:C 37
muriloboratto / NCCL
Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.
nccl cuda mpi laplacian-matrix
35
JuliaGPU / NCCL.jl
A Julia wrapper for the NVIDIA Collective Communications Library.
cuda gpu julia nccl
Language:Julia 29
lanl / pyDNMFk
Python Distributed Non Negative Matrix Factorization with custom clustering
tensorfactorization nonnegative-matrix-factorization distributed-computing hpc mpi4py latent-features cupy machine-learning nccl outofmemory python
Language:Python 24
Zerohertz / Instruct_KR_2025_Summer_Meetup_vLLM
🎹 Instruct.KR 2025 Summer Meetup: 오픈소스 LLM, vLLM으로 Production까지 🎹
cuda kubernetes llm mlops nccl rdma serving vllm ray
Language:Shell 21
1duo / nccl-examples
NCCL Examples from Official NVIDIA NCCL Developer Guide.
deep-learning distributed-systems nccl nvidia
Language:CMake 19
BaguaSys / bagua-net
High performance NCCL plugin for Bagua.
nccl bagua distributed-computing
Language:Rust 15
Zerohertz / PyCon_KR_2025_Tutorial_vLLM
🐍 PyCon Korea 2025 Tutorial: vLLM의 OpenAI-Compatible Server 톺아보기 🐍
cuda kubernetes llm mlops nccl pycon ray rdma serving vllm
Language:Shell 10
YinLiu-91 / ncclOperationPlus
use ncclSend ncclRecv realize ncclSendrecv ncclGather ncclScatter ncclAlltoall
nccl cuda ncclsendrecv ncclgather ncclscatter ncclalltoall mpi cpp cplusplus
Language:Cuda 8
YconquestY / nccl
Summary of call graphs and data structures of NVIDIA Collective Communication Library (NCCL)
allreduce collective-communication computer-network nccl
Language:D2 7
lancelee82 / pynccl
Nvidia NCCL2 Python bindings using ctypes and numba.
nccl numba ctypes python
Language:Python 6
UCBerkeley-Spring2022-CS267-project / blinkplus
Blink+: Increase GPU group bandwidth by utilizing across tenant NVLink.
collective-communication gpu nccl
Language:Jupyter Notebook 6
asprenger / distributed-training-patterns
Experiments with low level communication patterns that are useful for distributed training.
mpi mpi4py nccl horovod tensorflow distributed-training
Language:Python 5
rohwid / auto-nvidia-cuda-driver
Installation script to install Nvidia driver and CUDA automatically in Ubuntu
nvidia-driver cuda cudnn nccl bash-scripting ubuntu1604 ubuntu1804
Language:Shell 5
lcskrishna / nccl-rccl-parser
Tool to run rccl-tests/nccl-tests based on from an application and gather performance.
nccl rccl
Language:Python 3
dereklstinson / nccl
golang wrapper for nccl
cuda parallel-computing go deep-learning nccl
Language:Go 2
lancelee82 / necklace
Distributed deep learning framework based on pytorch/numba/nccl and zeromq.
distributed-deep-learning nccl zerorpc pytorch mxnet numba deep-learning distributed distributed-training
Language:Python 2
MurrellGroup / Conflux.jl
Single-node data parallelism in Julia with CUDA
cuda data-parallelism flux julia nccl
Language:Julia 2
rodhuega / tfgMatrixNccl
Librería de operaciones matemáticas con matrices multi-gpu utilizando Nvidia NCCL.
nccl gpu-computing gpu gpu-programming cuda cpp11 nvidia
Language:Cuda 2
superlinear-ai / scipy-notebook-gpu
jupyter/scipy-notebook with CUDA Toolkit, cuDNN, NCCL, and TensorRT
docker scipy-notebook tensorflow cuda cudnn nccl tensorrt
Language:Dockerfile 2
Keysight / kai-dc-builder
KAI Data Center Builder
ai-network ai-performance ai-performance-optimization nccl rdma-benchmarks rmda rocev2 ultraethernet
Language:Makefile 1
nikombr / hpccuda
Advanced High Performance Computing in C with OpenMP, CUDA, MPI and NCCL. The folder project includes my final project for the special course. I implemented a Jacobi-solver for the Poisson partial differential problem both using OpenMP in the CPU, using CUDA on the GPU and using CUDA, MPI and NCCL on multiple GPUs.
cpu cuda gpu high-performance-computing hpc mpi nccl openmp
Language:C++ 1
ParCoreLab / aCG
GPU-accelerated linear solvers based on the conjugate gradient (CG) method, supporting NVIDIA and AMD GPUs with GPU-aware MPI, NCCL, RCCL or NVSHMEM
conjugate-gradient cuda gpu hip linear-solvers multi-gpu nccl nvshmem rccl exascale-computing hpc
Language:C 1
YconquestY / ncclAllReduce
allreduce mpi multi-gpu multi-node nccl nvlink sharp
Language:Cuda 1
HsinfangChao / nccl-precheck
Infra environment checklist before nccl-test
aiinfra gpu llm nccl numa nvidia
jman4162 / Sizing-AI-Training-by-Cost-per-Memory-Bandwidth
A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next
ai cost-optimization hbm llm machine-learning ml transformer ai-infrastructure aws aws-ec2 distributed-systems distributed-training llm-training memory-bandwidth nccl pytorch roofline-model systems-performance
Language:Jupyter Notebook
TyBruceChen / Tutorial-Conda-cuDNN-NCCL-installation-for-Pytorch
This is a tutorial for installing CUDA (v11.8) and cuDNN (8.6.9) to enable programming torch with GPU. It also mentions about implementation of NCCL for distributed GPU DNN model training.
cuda-installation nccl pytorch-installation ubuntu ubuntu-server windows-10
Language:Jupyter Notebook

nccl

cupy / cupy

coreylowman / cudarc

huggingface / llm_training_handbook

LambdaLabsML / distributed-training-guide

huggingface / large_language_model_training_playbook

FZJ-JSC / tutorial-multi-gpu

Bluefog-Lib / bluefog

microsoft / msrflute

google / nccl-fastsocket

openhackathons-org / nways_multi_gpu

muriloboratto / NCCL

JuliaGPU / NCCL.jl

lanl / pyDNMFk

Zerohertz / Instruct_KR_2025_Summer_Meetup_vLLM

1duo / nccl-examples

BaguaSys / bagua-net

Zerohertz / PyCon_KR_2025_Tutorial_vLLM

YinLiu-91 / ncclOperationPlus

YconquestY / nccl

lancelee82 / pynccl

UCBerkeley-Spring2022-CS267-project / blinkplus

asprenger / distributed-training-patterns

rohwid / auto-nvidia-cuda-driver

lcskrishna / nccl-rccl-parser

dereklstinson / nccl

lancelee82 / necklace

MurrellGroup / Conflux.jl

rodhuega / tfgMatrixNccl

superlinear-ai / scipy-notebook-gpu

Keysight / kai-dc-builder

nikombr / hpccuda

ParCoreLab / aCG

YconquestY / ncclAllReduce

HsinfangChao / nccl-precheck

jman4162 / Sizing-AI-Training-by-Cost-per-Memory-Bandwidth

TyBruceChen / Tutorial-Conda-cuDNN-NCCL-installation-for-Pytorch