restructure the multi-head attention layer
jijoongmoon opened this issue · comments
We can optimize the memory consumption of the multi-head attention layer by combination of layers. By doing this, we could reduce the memory further.
- compute multi-head one by one.
- re-implement the multi-head attention layer as an backbone layer.
To list for 1
- Enhance split layer to split input by given number(number of head). #2025
- Replace a multi head attention layer by make a sub-graph
- Compare the peak memory consumption and latency before and after changes
- Compare the peak memory consumption and latency before and after enabling the swap feature