Use fma, fused multiply add, for architectures supporting fma

Question

Use fma, fused multiply add, for architectures supporting fma

SuperFluffy opened this issue 6 years ago · comments

Richard Janis Goldschmidt commented 6 years ago

Modern Intel architectures supporting fma instruction sets can perform the first loop calculating the matrix-matrix product between panels a and b in one go using _mm256_fmadd_pd. We should implement these and see how they affect performance.

bluss · Answer 1 · Mon Dec 03 2018 22:17:20 GMT+0800 (China Standard Time)

Let's land the pure avx dgemm first

Richard Janis Goldschmidt · Answer 2 · Mon Dec 03 2018 22:59:55 GMT+0800 (China Standard Time)

These performance gains are just lovely:

 name                 ./dgemm_avx ns/iter  dgemm_fma ns/iter  diff ns/iter   diff %  speedup 
 layout_f64_032::ccc  3,258                2,396                      -862  -26.46%   x 1.36 
 layout_f64_032::ccf  3,212                2,320                      -892  -27.77%   x 1.38 
 layout_f64_032::cfc  3,384                2,525                      -859  -25.38%   x 1.34 
 layout_f64_032::cff  3,341                2,457                      -884  -26.46%   x 1.36 
 layout_f64_032::fcc  3,144                2,276                      -868  -27.61%   x 1.38 
 layout_f64_032::fcf  3,090                2,189                      -901  -29.16%   x 1.41 
 layout_f64_032::ffc  3,251                2,376                      -875  -26.91%   x 1.37 
 layout_f64_032::fff  3,201                2,296                      -905  -28.27%   x 1.39 
 mat_mul_f64::m004    175                  171                          -4   -2.29%   x 1.02 
 mat_mul_f64::m006    237                  226                         -11   -4.64%   x 1.05 
 mat_mul_f64::m008    257                  242                         -15   -5.84%   x 1.06 
 mat_mul_f64::m012    516                  431                         -85  -16.47%   x 1.20 
 mat_mul_f64::m016    648                  519                        -129  -19.91%   x 1.25 
 mat_mul_f64::m032    3,296                2,384                      -912  -27.67%   x 1.38 
 mat_mul_f64::m064    22,168               14,856                   -7,312  -32.98%   x 1.49 
 mat_mul_f64::m127    160,532              104,342                 -56,190  -35.00%   x 1.54

Richard Janis Goldschmidt · Answer 3 · Mon Dec 03 2018 23:08:24 GMT+0800 (China Standard Time)

Sgemm is not as impressive, but has some serious improvement as well:

 layout_f32_032::ccc  2,050              1,788                      -262  -12.78%   x 1.15 
 layout_f32_032::ccf  2,050              1,774                      -276  -13.46%   x 1.16 
 layout_f32_032::cfc  2,317              2,042                      -275  -11.87%   x 1.13 
 layout_f32_032::cff  2,316              2,046                      -270  -11.66%   x 1.13 
 layout_f32_032::fcc  1,796              1,527                      -269  -14.98%   x 1.18 
 layout_f32_032::fcf  1,799              1,513                      -286  -15.90%   x 1.19 
 layout_f32_032::ffc  2,058              1,784                      -274  -13.31%   x 1.15 
 layout_f32_032::fff  2,052              1,785                      -267  -13.01%   x 1.15 
 mat_mul_f32::m004    187                171                         -16   -8.56%   x 1.09 
 mat_mul_f32::m006    210                208                          -2   -0.95%   x 1.01 
 mat_mul_f32::m008    179                175                          -4   -2.23%   x 1.02 
 mat_mul_f32::m012    524                458                         -66  -12.60%   x 1.14 
 mat_mul_f32::m016    492                429                         -63  -12.80%   x 1.15 
 mat_mul_f32::m032    2,036              1,793                      -243  -11.94%   x 1.14 
 mat_mul_f32::m064    12,621             11,283                   -1,338  -10.60%   x 1.12 
 mat_mul_f32::m127    88,308             82,163                   -6,145   -6.96%   x 1.07

bluss · Answer 4 · Mon Dec 03 2018 23:55:15 GMT+0800 (China Standard Time)

That's amazing