Use fma, fused multiply add, for architectures supporting fma
SuperFluffy opened this issue · comments
Modern Intel architectures supporting fma instruction sets can perform the first loop calculating the matrix-matrix product between panels a and b in one go using _mm256_fmadd_pd
. We should implement these and see how they affect performance.
Let's land the pure avx dgemm first
These performance gains are just lovely:
name ./dgemm_avx ns/iter dgemm_fma ns/iter diff ns/iter diff % speedup
layout_f64_032::ccc 3,258 2,396 -862 -26.46% x 1.36
layout_f64_032::ccf 3,212 2,320 -892 -27.77% x 1.38
layout_f64_032::cfc 3,384 2,525 -859 -25.38% x 1.34
layout_f64_032::cff 3,341 2,457 -884 -26.46% x 1.36
layout_f64_032::fcc 3,144 2,276 -868 -27.61% x 1.38
layout_f64_032::fcf 3,090 2,189 -901 -29.16% x 1.41
layout_f64_032::ffc 3,251 2,376 -875 -26.91% x 1.37
layout_f64_032::fff 3,201 2,296 -905 -28.27% x 1.39
mat_mul_f64::m004 175 171 -4 -2.29% x 1.02
mat_mul_f64::m006 237 226 -11 -4.64% x 1.05
mat_mul_f64::m008 257 242 -15 -5.84% x 1.06
mat_mul_f64::m012 516 431 -85 -16.47% x 1.20
mat_mul_f64::m016 648 519 -129 -19.91% x 1.25
mat_mul_f64::m032 3,296 2,384 -912 -27.67% x 1.38
mat_mul_f64::m064 22,168 14,856 -7,312 -32.98% x 1.49
mat_mul_f64::m127 160,532 104,342 -56,190 -35.00% x 1.54
Sgemm is not as impressive, but has some serious improvement as well:
layout_f32_032::ccc 2,050 1,788 -262 -12.78% x 1.15
layout_f32_032::ccf 2,050 1,774 -276 -13.46% x 1.16
layout_f32_032::cfc 2,317 2,042 -275 -11.87% x 1.13
layout_f32_032::cff 2,316 2,046 -270 -11.66% x 1.13
layout_f32_032::fcc 1,796 1,527 -269 -14.98% x 1.18
layout_f32_032::fcf 1,799 1,513 -286 -15.90% x 1.19
layout_f32_032::ffc 2,058 1,784 -274 -13.31% x 1.15
layout_f32_032::fff 2,052 1,785 -267 -13.01% x 1.15
mat_mul_f32::m004 187 171 -16 -8.56% x 1.09
mat_mul_f32::m006 210 208 -2 -0.95% x 1.01
mat_mul_f32::m008 179 175 -4 -2.23% x 1.02
mat_mul_f32::m012 524 458 -66 -12.60% x 1.14
mat_mul_f32::m016 492 429 -63 -12.80% x 1.15
mat_mul_f32::m032 2,036 1,793 -243 -11.94% x 1.14
mat_mul_f32::m064 12,621 11,283 -1,338 -10.60% x 1.12
mat_mul_f32::m127 88,308 82,163 -6,145 -6.96% x 1.07
That's amazing