massively-parallel

A growing list of resources, exercises, and experimentation related to starting parallel processing, ultimately to write fast implementations of large ML/AI models. These are also my notes for where I'm at on them, and my public journal for my learning process.

PMPP:

Buy the Programming Massively Parallel Processors book.

GPU Puzzles:

The original GPU-Puzzles (Numba - Python syntax for CUDA): My fork of srush/GPU-Puzzles: 14 GPU puzzles in a Colab notebook. I altered the first couple tests because a beginner can easily pass several on accident before realizing they're wrong (like myself). When I revisit this I'll continue those tests for more. As of writing, I've worked up to and still am working on Puzzle 13, because I'm too stuborn and only want to use the walkthrough to check my answers.
CUDA C++ GPU Puzzles: I haven't taken a look at this yet. Maybe as I work through PMPP.
Triton Puzzles (another by Sasha Rush, who made the original puzzles): Srush worked with some interpreter folks on a visualizer for Triton debugging. Specifically it tries to make it easier to view the spatial structure of load/stores when implementing complex functions. As if there weren't enough options to learn, here's one if you want to feel like you work at OpenAI.

Selected CUDA MODE Lectures:

Getting Started With CUDA for Python Programmers: Jeremy Howard teaching an introductory lesson as a part of the CUDA MODE lecture series. This lecuture aims to set you up well to work thorough the PMPP book, and explains an awesome hack to write CUDA code to run on Colab GPUs. No longer do we need our own NVIDA GPU! One can learn on a chromebook, or what I plan on doing, is learning CUDA & also to help learn PP concepts in general, to then apply to Metal, Triton, & Mojo.
Starter code only: I stripped down the notebook from Lecture 3 to be the barebones of what one needs to write a kernel on Colab. The compile time is rough, so I attempted and still would like to add the optimizations discussed here and put into a minimal repo here.

Selected Resources:

Parallelizing Matrix multiplication (Mojo): From the Mojo site. A notebook starting from a sequential matmal I don't think you can currently use Colab (I tried to connect to a local runtime with mojo installed, it didn't work), but you can either install mojo and open a jupyter notebook, or even more simply use Modular's hosted Jupyter notebook environment called Mojo Playground.

Awesome GPU Resources: Resources, mostly papers, on [GPU] architecture, algorithms, applications, tools, runtime, & code generation.

Journal:

4/29/24: Spent another session just looking at the optimizers, trying to understand the schedule-free implementations, reading the optimizing step sections of llm.c which there'a c reference and some cuda kernels, thinking about what a matching implementation would look like, keeping an eye on how hard they're going in the cuda mode discord, reading the pytorch optim adamw and learning rate schedulers impls, trying to put it all together. This is gonna suck if it doesn't all click soon and I get some testing environment up. I suppose I should resist fearing that I'm burning time not producing anything and give in to the fact that a task like this will take time for me at this stage
4/27/24: Woke up this morning deciding I need to understand the new schedule-free optimizers, so I spent roughly 3 hours refreshing my understanding of SDG, Momentum, RMSPROP, and Adam [and AdamW]. Listing them out like that probably makes it seem more complicated than it is - they're all slight modifications on the weight update rule. Schedule-free is a bit different, because before learning about Schedule-free, first I need to better understand Adam with a schedule, which is AdamW, which is Adam with weight decay, which is (in the weight update: w <- w - LR*dw), multiplying the w in (w - LR*dw) by (1 - weight_decay), where weight_decay is an explicitly defined value. So Adam < AdamW < schedule-free Adam (good looking evidence), and weight_decay doesn't need to be explicitly set - it's automatically found.

I have a lofty goal of getting a working fork of llm.c to this (then a loftier goal of writing a CUDA kernel for it), but I'm working on understanding the torch implementation of AdamW (or SDG), so I can then better understand the schedule-free version, so I can actually write the goddamn code lol.

Later, going to help record today's regularly scheduled CUDA MODE lecture,

& lastly, gonna celebrate my 2-year anniversary with my wife❤️‍🔥

4/26/24: I helped out recording a bonus lecture on CCCL, I believe basically libraries that make writing production CUDA C++ doable. Edited & shared a link for Mark to upload
4/25/24: I had reached back out to Jesse after a couple weeks asking if he'd still like me to do some benchmarking, and he got back to me today and we hopped on a call. Fortunately, although I was concerned I wanted to avoid paying anything to benchmark on an A100 (a server-grade GPU of the Ampere generation, Ampere being what's needed to benefit from Sparsity), my brother has a nice PC that he's left at my place for months (video game addiction), and it has a 3060, which is Ampere architecture. So I spun up his PC: installed ... everything. This issue is tracked here
4/23/24: Finished reading & exercises for Chapter 6, Part I complete 😎
4/22/24: Started Chapter 6, the last chapter of Part I which is on fundamentals. As of writing, I just read 6.1 on Memory coalescing. Thoughts: Thinking about warps, I suppose the max amount of memory that can be part of a coalesced access is the max threads per warp (generally 32) * max number of warps in an SM that can run simutaneously? Can threads part of different warps make use of access bursts?
4/20/24: Finished Chapter 5 exercises. Helped record & edit Lecture 15 on CUTLASS!
4/11/24: Finished Chapter 4 exercises. Still need a way to check them.
4/8/24: Read Chapter 4 after a 12 hour delivery shift. Goal is to finish the exercises by the end of the week.
4/5/24: Finally finished Ch3 exercise 1. a & b 🥳. I was hesitant to draw a picture, thinking I shouldn't need to, but honestly not only did it make it so much easier, but I think my understanding is better after having drawn it. I have an iPad, but straight up nothing beats a pencil and sketch notebook. I truly wonder if I'll ever get as comfortable on a tablet. It's noon now, going on a last training run with my brother (we're running in a 10K in 2 days), and going go-karting with the homies around 6:30pm. That gives me between maybe 1:30pm-6pm to finish 2 & 5, no problem😼. Done. Go karting was extremely fun.
4/4/24: I stripped down Lecture 3 starter code (here), going to now use this as a starting point for all the PMPP exercises
3/31/24: Not wanting me to pay for compute (thanks), Jesse instead gave me some tasks to run some sparse kernel benchmarks. I think I'm going to finish Ch 3 & 4 before picking this thread back up.
3/30/24: Colab gave me an A100 today, and I ran through the tutorial, but I tried to save and load the model to avoid ever needing to retrain. When I didn't get matching results in the end, I realized that somehow saving and loading the model changed it. At this point I was out of compute units.
3/28/24: As Jesse requested, I ran through this tutorial, using Google Colab, but even just a few epochs used all the compute units available to me. When I realized I was burning through the units, I wanted to add tracking the loss before doing full training. To do this, I subclassed transformers.Trainer, and overrode transformers.Trainer.training_step() to record the loss at each step to Weights & Biases. Colab only gave me a V100 that day, Jesse lmk that it may not abe able to benefit from sparsity and he'd look into it.
3/27 - 4/2: BS interview, recovered from cold & frustration, continued training for 10k, bowled with the homies, funeral & lecture 12 on flash attention, work all day Sunday-Tuesday.
3/26/24: Wife and I were sick, she went to work, I slept through my alarm which I never do, and my supervisior suggested I take the day off, which I gladly took, but my wife was not glad. Anyways I finished reading Chapter 3, and did problems 3 & 4, but my sick-ass brain was not working, and I had an interview the next morning, so I decided to rest with my wife the rest of the day. Talked to Jesse a bit, got access to ImageNet for running some ViT training experiments.
3/23/24: Out of frustruation I brazenly set a goal that I wasn't able to reach, basically that I was going to study/practice every night after a 10 hour shift. I had a good session the first night at the cost of my wife being pissed at me. The struggle of finding work/life balance continues, I feel like I lose either way. However, I think it's a good signal that throughout my delivery shifts, I have this deep yearning, wishing I was instead spending my time progressing my engineering skills - it reminds me of how I used to feel when I would be in school and couldn't wait to get out to go skateboard, being in love with the feeling of progressing.
3/22/24: In the morning finished chapter 2 and working on the exercises now. Will finish chapter 3 and probably skim/skip Lecture 2, and today I at least want to upload a CUDA C++ vector addition program whether in the Colab notebook format like in JH's lecture, or if I'm really ambitious, get CUDA set up on my PC and run it locally. I think I set it up but never actually wrote/ran anything so idk. Update: I did neither that day, but I did attend Lecture 11 on Sparsity by Jesse Cai, and expressed my interest to him in helping contribute to torchao.
3/21/24: Started Lecture 2 then realized it's a recap of chapters 1-3. Read most of chapter 2 before bed.
3/20/24: Hoping to watch Lecture 2, and see how far along I can Read & do the exercises of chapter 2.
3/19/24: Read parts of the preface and through most of Chapter 1 of PMPP. Thanks to my annotation system I got going on, I see I read 1.1, and skimmed the rest. I'm eager to get to more of the meat, and have indicated where I could go back to read more thoroughly. Pointing out from "Preface/A two-phased approach":
- Most chapters designed to be covered in 1 75-min lecture, with 11 14 & 15 might needing 2.
- (7 weeks) Phase 1: Parts I & II, 12 chapters - fundamentals, basic patterns, guided programming assignments
- ... The reason I'm pointing this out is because it looks to me like learning PP is going to be a marathon rather than a sprint, so I'm going to remind myself to stay patient, consistent, & to try and keep a good pace throughout my practice.
3/1/24: Watched CUDA MODE Lecture 1

philipbutler / massively-parallel

massively-parallel

Journal:

About

Languages