flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Review obj_t-related stack consumption

hominhquan opened this issue · comments

BLIS internal layers are mostly re-cloning and re-aliasing obj_t a, b, c each time (bli_?_front, bli_l3_thread_entry, bli_gemm_int as well as bli_?_blk_var?). This increases the management overhead (obj_t aliasing) and consumes a lot of stack, which can be problematic on memory-constrained platforms.

Can we take a look if some cloning logic can be relaxed (between multi-threading isolation (must clone) and self-execution of each thread (only clone if required by algorithms)) ?

Some of this will naturally be addressed when @devinamatthews obviates the need for the bli_gemm_int() function, which is on his docket. But yes, we do a lot of aliasing under the assumption that it's cheap.

We could probably get by with aliasing each matrix obj_t only once, near the very top of the call stack.

Not if you want to be able to use task-based parallelism... However, only aliasing two of the three matrices in each gemm variant is sufficient. This is maybe 30-40% of the current number of aliases?

Some of this will naturally be addressed when @devinamatthews obviates the need for the bli_gemm_int() function, which is on his docket

+1
@devinamatthews As I can see, there is also some aliasing in bli_?_front, bli_l3_thread_entry, and bli_?_blk_var?.