LLaMA from First Principles
I wanted to learn more about how transfomers worked, so I spent a night hacking at an implementation of LLaMA from scratch.
No ML frameworks. No BLAS. Minimal abstractions. Clarity over performance.
I haven't finished it, but it's probably more than half way finished.
Remaining pieces
- Rotary Position Embedding
- Finish adding all the layer operations
- Fix all the bugs in the matrix multiplication implementations
- Token decoding