Adding support for larger SharedArray

Question

Adding support for larger SharedArray

Linyou opened this issue 2 years ago · comments

Concisely describe the proposed feature

SharedArray helps store the intermediate results so we don't need to write back to global memory, which really increases the performance.

However, as quoted from CUDA doc: link

Devices of compute capability 8.0 allow a single thread block to address up to 163 KB of shared memory, while devices of compute capabilities 8.6 and 8.9 allow up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize; see Shared Memory for the Volta architecture.

Right now, we can only allocate 48KB of shared memory (SharedArray), even with new high-end GPUs which already have 99KB or more shared memory. So, is this possible for Taichi to support dynamic shared memory in CUDA that allows us to allocate more shared memory?

Haidong Lan · Answer 1 · Thu Oct 20 2022 14:43:09 GMT+0800 (China Standard Time)

Thank you for bring this up!

We need to generate code for different device capabilities. We currently pin the codegen at sm75 which only supports up to 64KB. It's a little complex at the time being for Taichi. Let me think about this carefully.

Youtian Lin · Answer 2 · Thu Oct 20 2022 16:28:26 GMT+0800 (China Standard Time)

Thanks! This is a great feature for Taichi.