TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enable `cp.async` when load data from global memory to shared memory.

haruhi55 opened this issue · comments

The cp.async feature is currently disabled in the implementation.

static const bool enable_cp_async = false; // change this flag

This is because CuTe's TiledCopy function raises an error when the Layout is created with runtime values.

"Copy_Traits: src failed to vectorize into registers. Layout is incompatible with this CopyOp.");

However, I am wondering if this is an issue with CuTe's overly strict check. Since I have commented out the static check during compile time, it does not affect the correctness of the implementation.

https://github.com/NVIDIA/cutlass/blob/033d9efd2db0bbbcf3b3b0650acde6c472f3948e/include/cute/atom/copy_traits.hpp#L122-L125