google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use a MatMul implementation over MatVec for Prefill Computations

austinvhuang opened this issue · comments

Call for contributions for anyone interested in taking this on (@jan-wassenberg feel free to tag anyone who might be interested). The Prefill() computation is setup to allow batched computation (currently statically sized as kPrefillBatchSize).

Some pointers:

  • Activations type is templated by batch size with this in mind, so to a first approx, this can be done by replacing MatVec operations with a Matmul for the Activation data that is batched for kPrefillBatchSize > 1
  • Prefill calls FFW() and Attention(), so the implementation changes are probably happen there. Since kBatchSize is known at comptime, this could probably even be done with if constexpr
  • As a first step, might start with trying just with the FFW() and assess performance differences since there's less implementation complexity to deal with.

Thanks! @pculliton @samkaufman FYI.
We'll soon have a basic MatMul to test with.

This is now done :D