So LLMs are practically everywhere right know and most of us do know how they work but we don't understand how they work. Yes, there is a difference...That's why I'm making this repo to compile and implement everything and every technique associated to LLMs into one repo. Basically making a clusterfuck of everything that is associated with (L)LM Training and Serving. We'll use a LLama-like model for the whole thing, like a minimal af LLama.
Model:
- Add Arxiv Dataset
- Add RoPE Embedding Module
- Add PicoLLama Model
Training:
- Add Training Script
- Add QLoRA
- Add Distributed Training
- Add Mixed-Precision Training
Serving:
- Add KV Caching
- Add Speculative Decoding
- Add Quantization
- Add Flash Attention v2