- efficient tiling based large matrix multiplication
- warp reduce based Matrix Vector multiplication
- warp reduce based vector dot product
- warp reduce
- Flash attention module 1: fused QKV attention (tiling based)
- softmax(Q^T K / scale) can be easily fused
- the extra V... well, it's a pain in the ass, TBH
- coalsescing memory access benchmarking
- thread pool (condition variable and simple multi-threading)
- double buffer (
std::timed_mutex
and simple multi-threading) with simple benchmarking - cache update algorithms:
- LRU (least recently used)
- LFU (least frequently used)