kf-zhang's starred repositories
modern-cpp-tutorial
📚 Modern C++ Tutorial: C++11/14/17/20 On the Fly | https://changkun.de/modern-cpp/
ml-visuals
🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.
AITemplate
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
tiny-cuda-nn
Lightning fast C++/CUDA neural network framework
tvm_mlir_learn
compiler learning resources collect.
TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
core-to-core-latency
Measures the latency between CPU cores
Triton-Puzzles
Puzzles for learning Triton
ringattention
Transformers with Arbitrarily Large Context
ring-flash-attention
Ring attention implementation with flash attention
multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
hack-SysML
The road to hack SysML and become an system expert
how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
ring-attention
ring-attention experiments
gpu-arch-microbenchmark
Dissecting NVIDIA GPU Architecture
compile-time-printer
Prints values and types during compilation!
ring-attention-pytorch
tiny ring attention implement for learning purpose