CUDA Progress
| Day | Code Summary |
|---|---|
| Day 1 | CUDA set up and kernel that prints "Hello World" |
| Day 2 | CUDA kernel that adds two vectors |
| Day 3 | Adding matrices |
| Day 4 | Vector addition using cuBLAS |
| Day 5 | Naive matmul |
| Day 6 | Tiled matmul using shared memory |
| Day 7 | Naive 1D convolution with boundary checks |
| Day 8 | Matrix multiplication using cuBLAS |
| Day 9 | Matrix Transpose |
| Day 10 🥳 | Naive Softmax |
| Day 11 | Softmax using shared memory and reductions |
| Day 12 | Softmax using warp shuffle functions |
| Day 13 | 1D complex-to-complex fourier transform using cuFFT |
| Day 14 | Naive layer normalization |
| Day 15 | Optimizing layer norm using shared memory |
| Day 16 | Optimizing layer norm using warp shuffle functions |
| Day 17 | Optimizing layer norm using vectorized loads |
| Day 18 | Tiled 1D convolution and halo cells |
| Day 19 | 1D convolution using L2 cache |
| Day 20 🥳 | Blog Post: Optimizing Layer Normalization with CUDA |
| Day 21 | Simple self attention |
| Day 22 | Optimizing self attention |
| Day 23 | Causal attention with masking |
| Day 24 | Causal attention + torch binding |
| Day 25 | Multi-head attention |
| Day 26 | Parallel add using koggle stone algorithm |
| Day 27 | MHA debug |
| Day 28 | Flash Attention 1 (algorithm 1) Forward pass |
| Day 29 | Flash Attention 1 (algorithm 1) Forward pass continued |
| Day 30 🥳 | Flash Attention 1 (algorithm 1) Forward pass |
| Day 31 | HGEMV matvec using fp16 |
| Day 32 | HGEMV matvec using Bfloat16 |
| Day 33 | Matmul using Tensor cores |
| Day 34 | Swizzle patterns on matrix transpose |
| Day 35 | Swizzled matrix transpose using Tensor Memory Accelerators |
| Day 36 | Brent Kung Parallel Scan algorithm |
| Day 37 | Matvec using integer fixed point arithmetic |
| Day 38 | Transfered 1D array from gmem->smem->gmem using TMA |
| Day 39 | Memory Coalesced layernorm + revisited Flash attention |
| Day 40 🥳 | revisited Flash Attention 1 |
| Day 41 | Flash Attention 1 |
| Day 42 | Flash Attention 1 |
| Day 43 | ReLU Activation - FP32, FP32x4, FP16, FP16x2 vectorized |
| Day 44 | Overlapping data transfers using CUDA Streams (Vector add) |
| Day 45 | ReLU using CUDA Streams + benchmarked |
| Day 46 | Packed 128 bit ReLU FP16x8 kernel |
| Day 47 | Sparse matrix-vector mul (spMV) |
| Day 48 | Sparse padded matrix-vector mul |
| Day 49 | RoPE Kernel: Rotary Position Embedding naive fp32 |
| Day 50 🥳 | Optimized RoPE using vectorized loads and half precision (18x) |
| Day 51 | Flash Attention 2 Forward |
| Day 52 | Flash Attention 2 Forward |
| Day 53 | Flash Attention 2 Forward |
| Day 54 | Gaussian Elimination |
| Day 55 | PTX vector add kernel |
| Day 56 | GELU activatation naive fp32 kernel |
| Day 57 | GELU activation vectorized |
| Day 58 | Backward pass kernel for Relu activation |
| Day 59 | Backward pass kernel for GELU activation |
| Day 60 🥳 | LeetGPU challenge - reduction |
| Day 61 | Optimize + benchmarked gelu kernels |
| Day 62 | Micrograd in CUDA |
| Day 63 | Micrograd in CUDA |
| Day 64 | Micrograd in CUDA |
| Day 65 | Micrograd in CUDA |
| Day 66 | Optimized Sigmoid activation |
| Day 67 - Day 70 🥳 | Micrograd in CUDA |
| Day 71 | Sigmoid with half precision |
| Day 72 | Sigmoid with fp16 vectorized |
| Day 73 | Swish kernel |
| Day 74 | Swish kernel vectorized |
| Day 75 | AMD hip kernel intro + vector add kernel |
| Day 76 | Revisiting gemm optimizations |
| Day 77 | Gemm coalesced |
| Day 78 | fp16 swish |
| Day 79 | AMD competition fp8 gemm & swish optimizations |
| Day 80 🥳 | AMD competition fp8 gemm optimizations |
| Day 81 | AMD competition fp8 gemm optimizations |
| Day 82 - 83 | Micrograd in CUDA |