A minimal decoder-only transformer implemented in under 50 lines of PyTorch.
Implementing the Transformer architecture can be challenging for beginners due to its use of non-trivial information flow (attention, causal masks etc). To this end, we offer a stripped down, "simple as possible" implementation of a decoder-only transformer for pedagogical purposes.