Discrepancy in Haswell Results for Dgemm

Question

Discrepancy in Haswell Results for Dgemm

mert-kurttutan opened this issue a year ago · comments

Hi,
In your performance page here, it is noted that the peak performance for single threaded dgemm for haswell is 56 GFLOPS. However, if one looks at the corresponding graph, visually it looks a lot closer to around 41-42 GFLOPS.

In my local machine(which has haswell microarch), I also ran similar test i.e. single thread, and dgemm with similar matrix size. I do get 56-57 GFLOPS peak performance.

Is the plot for performance outdated ? or am I missing something?

Field G. Van Zee · Answer 1 · Thu Sep 07 2023 03:07:56 GMT+0800 (China Standard Time)

Hi, In your performance page here, it is noted that the peak performance for single threaded dgemm for haswell is 56 GFLOPS. However, if one looks at the corresponding graph, visually it looks a lot closer to around 41-42 GFLOPS.

Note that when we talk about peak performance (in GFLOPS), we are almost always talking about theoretical peak performance rather than the peak performance that is actually attainable/observed. This value is computed by multiplying (1) the number of elements in a vector register by (2) by the number of flops being executed per instruction (two for FMAs: one multiply + one add) by (3) the number of FMA instructions that can be issued per cycle by (4) the sustainable clock speed (in GHz). Perhaps this is the source of the confusion?

Also note that due to multicore frequency throttling, the peak performance per core when using many cores is sometimes a bit lower than single-core peak performance. (You may have already known or been aware of this, but I thought I'd mention it just in case.) And this multicore throttling is indeed present on the system I was testing at the time.

In my local machine(which has haswell microarch), I also ran similar test i.e. single thread, and dgemm with similar matrix size. I do get 56-57 GFLOPS peak performance.

Since one of the inputs to determining peak performance is clock speed (which can vary between various models of a given processor), different machines will have different theoretical peak performance levels. In order to assess your observed performance against your theoretical peak, you'll need to compute the peak for your own system.

Hope this helps!

mert-kurttutan · Answer 2 · Thu Sep 07 2023 04:07:28 GMT+0800 (China Standard Time)

Oh, sorry for the confusion. I thought of it as the theoretical performance, which would not make much sense to use, even though the same term is used consistently in your papers.

Field G. Van Zee · Answer 3 · Thu Sep 07 2023 04:49:49 GMT+0800 (China Standard Time)

Oh, sorry for the confusion. I thought of it as the theoretical performance, which would not make much sense to use, even though the same term is used consistently in your papers.

It's quite alright.

Just for the record, could you briefly explain where the miscommunication was? If we can improve our Performance document to reduce ambiguity for future readers, I'd be happy to integrate your feedback!

mert-kurttutan · Answer 4 · Thu Sep 07 2023 05:05:40 GMT+0800 (China Standard Time)

In the courses regarding gemm optimization, my lecturer used the term theoretical performance (also I read some other paper using this terminology rather than peak performance). I guess it must have stuck to me.
Maybe you could also put theoretical performance in parentheses.
Also, theoretical performance could not be misinterpreted in the way that I just did (Though, it might as well just be me). But your usage seems to be standard/meaningful enough :)

Also, if available, the numerical values of performance results would be nice (e.g. in the form of csv files). Some people might want to test the current branch in case they the same machine.

Field G. Van Zee · Answer 5 · Thu Sep 21 2023 06:30:45 GMT+0800 (China Standard Time)

Also, if available, the numerical values of performance results would be nice (e.g. in the form of csv files). Some people might want to test the current branch in case they the same machine.

This is great feedback. I'll try to make the raw results available the next time I update the performance figures.