taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.

Home Page:https://taichi-lang.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding support for larger SharedArray

Linyou opened this issue · comments

Concisely describe the proposed feature

SharedArray helps store the intermediate results so we don't need to write back to global memory, which really increases the performance.

However, as quoted from CUDA doc: link

Devices of compute capability 8.0 allow a single thread block to address up to 163 KB of shared memory, while devices of compute capabilities 8.6 and 8.9 allow up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize; see Shared Memory for the Volta architecture.

Right now, we can only allocate 48KB of shared memory (SharedArray), even with new high-end GPUs which already have 99KB or more shared memory. So, is this possible for Taichi to support dynamic shared memory in CUDA that allows us to allocate more shared memory?

Thank you for bring this up!

We need to generate code for different device capabilities. We currently pin the codegen at sm75 which only supports up to 64KB. It's a little complex at the time being for Taichi. Let me think about this carefully.

Thanks! This is a great feature for Taichi.