Better heuristic in getDefaultDistributedLoopTileSizes

Question

Better heuristic in getDefaultDistributedLoopTileSizes

pzread opened this issue 2 years ago · comments

Currently getDefaultDistributedLoopTileSizes will produces non-divisible tiling sizes and relies on getMaxTileSize to find the closest divisible sizes. However, sometimes it generates a non-ideal tiling size as the shape size is divided by 2 on the workgroup size. We want to divide the size by 2 as later as possible to make sure that the inner tile size can be the multiplier of the vector size.

For example, on the feature dim of a depthwise_conv2d in MobileNetV3, it is tiled as:

Shape: 240
Workgroup: 60
Inner tiling: 30

It can perform better if we tile it as:

Shape: 240
Workgroup: 48
Inner tiling: 16

This can be done by aggressively keeping the factor number 2 during the search of workgroup size and doing the best-effort to make the result size divisible (so getMaxTileSize won't kick in).