Better heuristic in getDefaultDistributedLoopTileSizes
pzread opened this issue · comments
Currently getDefaultDistributedLoopTileSizes will produces non-divisible tiling sizes and relies on getMaxTileSize to find the closest divisible sizes. However, sometimes it generates a non-ideal tiling size as the shape size is divided by 2 on the workgroup size. We want to divide the size by 2 as later as possible to make sure that the inner tile size can be the multiplier of the vector size.
For example, on the feature dim of a depthwise_conv2d in MobileNetV3, it is tiled as:
- Shape: 240
- Workgroup: 60
- Inner tiling: 30
It can perform better if we tile it as:
- Shape: 240
- Workgroup: 48
- Inner tiling: 16
This can be done by aggressively keeping the factor number 2 during the search of workgroup size and doing the best-effort to make the result size divisible (so getMaxTileSize won't kick in).