[CPU] Explore ukernels for winograd transform ops
hanhanW opened this issue · comments
One of the performance issues we learned from recent convolution work is about winograd transform ops. There are chained matmul after decomposition. The shapes are related to filter sizes. We've seen a sequence of 8x8x8xf32
or 6x8x6xf32
matmul in some convolutions. It is not efficient for AVX512 because we have wider registers to hold the data and compute. Due to accuracy issue, we can not simply increase some factors to make it generate <*x16xf32>
matmuls. However, we still have advantages to use 2x registers comparing to AVX2.
Since the winograd transforms ops are critical for convolution. We plan to investigate the use of ukernels, and see if we can get faster kernel. @bjacob please fill more context if I'm missing something from our discussion.