Need for clean organization of register tile for Tensor Core output
haruhi55 opened this issue · comments
Currently, to provide a quick implementation, we use enumerated values. However, this approach is insufficient in the long run. The WMMA instruction has critical parameters that affect how data is distributed in memory, such as the output data type and the execution order of multiple WMMA instructions to compute a tile. This information impacts the correct implementation of the store kernel.
https://github.com/TiledTensor/TiledCUDA/blob/master/include/cell/copy/constants.hpp#L16