runomp on Mac M1 Max is slower than runfast

Question

runomp on Mac M1 Max is slower than runfast

tairov opened this issue 7 months ago · comments

Recently I did extensive benchmarks of llama2.c ports
I found that C version in runfast mode (singlethreaded) is working faster than runomp (multi threaded)

make runomp CC=/opt/homebrew/opt/llvm/bin/clang; OMP_NUM_THREADS=5 ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 529.976019

VS

make runfast; ./run ../models/stories15M.bin -t 0.0 -n 256
...
achieved tok/s: 657.738095

Does anyone have insights into why this might be happening?

full benchmark report

Clemens Akens · Answer 1 · Mon Oct 23 2023 15:56:13 GMT+0800 (China Standard Time)

I recently incorporated multithreading into my Zig port of this project and made some relevant findings. Essentially, the overhead associated with initializing and terminating multiple threads per matrix-vector multiplication can compromise efficiency with smaller 'tinystory' models.

Specifically, with a single M1 Pro performance core, I am able to achieve up to 724.59 tok/s on the 15M model. However, employing 5 threads for the multiplications drops the performance down to 225.971 tok/s. Although the OMP implementation is likely to be more sophisticated, and possibly reuses threads, it appears that it faces similar challenges.

In comparison, applying multithreading on the Llama 2 7B model nearly doubles performance, as the vectors in this case are significantly larger. Consequently, the overhead of thread spawning becomes negligible.

Aydyn Tairov · Answer 2 · Mon Oct 23 2023 17:07:27 GMT+0800 (China Standard Time)

Hi @clebert , thanks for your comment.

Do you mean 724 tok/s achieved in single-threaded mode ? If so, it looks amazing !
I was thinking it somehow spin up threads in the background

Clemens Akens · Answer 3 · Mon Oct 23 2023 17:24:17 GMT+0800 (China Standard Time)

Yes in single-threaded mode. But this was my best ever measured run. Normally, it fluctuates between 680 and 700 tokens per second. Why there is this big variance, I don't know.

Aydyn Tairov · Answer 4 · Mon Oct 23 2023 17:38:17 GMT+0800 (China Standard Time)

@clebert do you know which Zig features contributed the most to the overall performance? Alignment, SIMD?

According to the extensive benchmark other llama2 implementations fluctuate. That's why better run them in rounds.

Clemens Akens · Answer 5 · Mon Oct 23 2023 18:07:52 GMT+0800 (China Standard Time)

The use of @Vector (SIMD) had the biggest effect. Without SIMD, you couldn't get anywhere near these results. Aligning the vectors to the cache line, on the other hand, did not have the effect I had hoped for. If at all hardly measurable. Even though I only measured manually and not so systematically.

Clemens Akens · Answer 6 · Mon Oct 23 2023 18:45:36 GMT+0800 (China Standard Time)

I forgot to mention one important optimization: @setFloatMode(.Optimized)

It has about the same effect as setting -ffast-math in the C version.

Clemens Akens · Answer 7 · Wed Oct 25 2023 22:46:31 GMT+0800 (China Standard Time)

@tairov I have conducted extensive benchmarks with my improved Zig implementation, using an Apple M2 Pro equipped with 12 cores and an Apple M1 Pro equipped with 10 cores.

Benchmark Results

In summary,

The 15M model presents its fastest performance in single-threaded mode. For the 42M/110M models, they both present their fastest performance on the M2 Pro with the use of 7 extra threads, and on the M1 Pro with the use of 5 extra threads.

I noticed that for your benchmarks, you seemed to have opted for 5 threads on an Apple M1 Max, which, from a CPU perspective, is identical to the M1 Pro 8/2. Have you found the use of 5 threads also increases performance with other implementations such as C++ and Mojo?

Aydyn Tairov · Answer 8 · Thu Oct 26 2023 07:04:02 GMT+0800 (China Standard Time)

Hey @clebert , I really appreciate you taking the time to improve llama2.zig, I think the ziglang community & maintainers might get valuable insights from it.

Just curios , what means workers = 0 😄

Yes, I've gotten the best results with 5 threads for the cpp, mojo & c implementations
Frankly speaking, for determining the number of threads I haven't used multiple rounds to find the best one, just ran inference few times in CLI.

And thank you for sharing results for different worker counts. I believe it will help me improve my benchmarking methodology as well. Now I'll try reproducing comparisons amongst leading llama2 implementations by varying the threads factor

Clemens Akens · Answer 9 · Thu Oct 26 2023 20:44:15 GMT+0800 (China Standard Time)

Hey @clebert , I really appreciate you taking the time to improve llama2.zig, I think the ziglang community & maintainers might get valuable insights from it.

Thank you 👍🏻

Just curios , what means workers = 0 😄

If the count of workers is set to zero, all computations will be performed single-threaded within the main thread. Beginning with a single worker, the matrix-vector multiplication is distributed into additional threads. It is subsequently divided into sections, or chunks of rows, equivalent to the number of workers. Therefore, the performance with a single worker — or extra thread — is expected to be the poorest. This is because there are no gains, only the additional overhead of synchronization.

Yes, I've gotten the best results with 5 threads for the cpp, mojo & c implementations Frankly speaking, for determining the number of threads I haven't used multiple rounds to find the best one, just ran inference few times in CLI.

As my tests have shown, both with the 42M model and with the 110M model, the version with 5 additional threads aka Workers on the M1 is the fastest. I'm curious whether it remains the same for Mojo. Then this seems to be the sweet spot, although the type of paralellization is certainly completely different...