restructure the multi-head attention layer

Question

jijoongmoon opened this issue 2 years ago · comments

We can optimize the memory consumption of the multi-head attention layer by combination of layers. By doing this, we could reduce the memory further.

taos-ci · Answer 1 · Thu Sep 08 2022 06:55:39 GMT+0800 (China Standard Time)

cibot: Thank you for posting issue #1998. The person in charge will reply soon.

hyeonseok lee · Answer 2 · Tue Oct 25 2022 11:54:26 GMT+0800 (China Standard Time)

To list for 1

Enhance split layer to split input by given number(number of head). #2025
Replace a multi head attention layer by make a sub-graph
Compare the peak memory consumption and latency before and after changes
Compare the peak memory consumption and latency before and after enabling the swap feature