Enable `cp.async` when load data from global memory to shared memory.
haruhi55 opened this issue · comments
The cp.async
feature is currently disabled in the implementation.
TiledCUDA/include/cell/traits/gemm.hpp
Line 60 in 8ad3974
This is because CuTe's TiledCopy
function raises an error when the Layout
is created with runtime values.
"Copy_Traits: src failed to vectorize into registers. Layout is incompatible with this CopyOp.");
However, I am wondering if this is an issue with CuTe's overly strict check. Since I have commented out the static check during compile time, it does not affect the correctness of the implementation.