jundaf's starred repositories
ThunderKittens
Tile primitives for speedy kernels
MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
remote-dataloader
PyTorch DataLoader processed in multiple remote computation machines for heavy data processings
git-filter-repo
Quickly rewrite git repository history (filter-branch replacement)
nccl-rdma-sharp-plugins
RDMA and SHARP plugins for nccl library
python-bpe
Byte Pair Encoding for Python!
dataloader-benchmarks
DL Dataloader Benchmarks
tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
CPU-Free-model
Source code for the CPU-Free model - a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch.
multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
tutorial-multi-gpu
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
cuda_scheduling_examiner_mirror
A tool for examining GPU scheduling behavior.
awesome-courses
:books: List of awesome university courses for learning Computer Science!
open-gpu-kernel-modules
NVIDIA Linux open GPU kernel module source
gpumembench
A GPU benchmark suite for assessing on-chip GPU memory bandwidth
detectron2
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.