Supplementary Material for Lectures

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Lecture 2: Recap Ch. 1-3 from the PMPP book

Lecture 3: Getting Started With CUDA

Lecture 4: Intro to Compute and Memory Architecture

Lecture 5: Going Further with CUDA for Python Programmers

Lecture 6: Optimizing PyTorch Optimizers

Lecture 7: Advanced Quantization

Lecture 8: CUDA Performance Checklist

Lecture 9: Reductions

Lecture 10: Build a Prod Ready CUDA Library

Lecture 11: Sparsity

Lecture 12: Flash Attention

Lecture 13: Ring Attention

Lecture 14: Practitioner's Guide to Triton

Lecture 15: CUTLASS

Lecture 16: On Hands profiling

Bonus Lecture: CUDA C++ llm.cpp

Lecture 17: GPU Collective Communication (NCCL)

Lecture 18: Fused Kernels

Lecture 19: Data Processing on GPUs

Lecture 20: Scan Algorithm

Lecture 21: Scan Algorithm Part 2

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Lecture 23: Tensor Cores

Lecture 24: Scan at the Speed of Light

Lecture 25: Speaking Composable Kernel

Lecture 26: SYCL MODE (Intel GPU)

Lecture 27: gpu.cpp

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Lecture 30: Quantized training

About

Material for gpu-mode lectures

Apache License 2.0

Languages

Language:Jupyter Notebook 86.0%Language:Python 9.4%Language:Cuda 2.1%Language:C 2.1%Language:CMake 0.1%Language:C++ 0.1%Language:Makefile 0.0%