TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a straightforward implementation for TileIterator

haruhi55 opened this issue · comments

By dividing a large tile into smaller sub-tiles, we create a 2D array of sub-tiles. The TileIterator provides a logical view of these sub-tiles, allowing us to use logical indices to iterate over them.

  1. Given a shared memory tile has a row-major layout like this:

    using Tile = SharedTile<float, RowMajor<4, 12>>;
    
    // Suppose a `Tile` has contents like this:
    0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
    12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 
    24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 
    36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 
  2. Chunk a large Tile along two dimensions, and we get a grid of sub-tiles. Use the iterator to iterate over sub-tiles along the first dimension:

    using Iterator = SharedTileIterator<Tile, TileShape<2, 4>>;
    
    for (int i = 0; i < Iterator::sc0; ++i) {
        tiles(i, _).to_tile().dump_value();
    }
    

    Expected outputs are:

    // Iteration-[0, _], the returned type is a `SharedTile`
    
    0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
    12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 
    
    // Iteration-[1, _], the returned type is a `SharedTile`
    24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 
    36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0,