rayon-rs / rayon

Rayon: A data parallelism library for Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rayon works very slow after MacOS Sonoma 14.4 has beed updated

xiyu1984 opened this issue · comments

commented

My program needs high parallel, and I use rayon v1.9.0 to make the data parallel processing.
It works nicely in the previous version of MacOS Sonoma, but after I updated it to Sonoma 14.4, everything slowed down.
The underlying schedule of the parallel mechanism seems to have changed in Sonoma 14.4.

This may not be the problem of rayon. Is there anyone met this problem too?

commented

I found the reason may relate to the competition for CPU resources between the work-stealing strategy of rayon and the system.

Again, note that Everything worked well on the previous versions of MacOS before Sonoma 14.4.

When not manually setting the num_threads(...), the default num_threads is 14 (m3 max 14 + 16). In this case, the system will "rob" the CPU resources back, and the "robbing" itself is costly.

Then I limited the num_threads to 4 as follows:

rayon::ThreadPoolBuilder::new().num_threads(4).build_global().unwrap();

The system is still "robbing", but the user process can use these 4 threads most of the time. And in my program's case, the performance improves although it's still much slower than before as the CPU cores are not fully exploited. This is just a temporary solution.

That's worrisome, but I'm afraid I don't have any Apple hardware to test this myself. Hopefully others in the community can share their experience and help debug what's going on.

One small tip -- if you haven't set num_threads manually, the RAYON_NUM_THREADS environment variable will also override the default setting.

commented

I think the problem may not be all related to rayon.

From my experience until now, the number of threads needs to be limited below the number of cores. The details are as follows:

  • if I limit the num_threads to 4, parallel works stably.
  • if I limit the num_threads to 8, parallel works stably sometimes, but there's a chance to be "robbed".
  • if I limit the num_threads to 12, parallel works stably sometimes, but there's a higher chance to be "robbed".

Maybe the larger num_threads be used, the higher the probability of being "robbed".

And this is how the resource was "robbed" by the system:
image

and maybe this is why 4 num_threads can work.

I think you'll need to figure out what that System time is actually doing, because that looks pathological. Does Xcode profiling or anything like that reveal System details?

commented

I think you'll need to figure out what that System time is actually doing, because that looks pathological. Does Xcode profiling or anything like that reveal System details?

Now I just checked the information in the activity monitor, and as you said, there were pathological and conflict phenomena.
The picture reveals the system takes the CPU resources away, but the cost of each process shows it's my process that takes the most CPU resources. But I'm sure that my process is slowed down so the CPU resources are not computing it.

Anyway, I will look into this problem more deeply soon according to your suggestion.

commented

Things are clearer.

I made a deeper profiling and found that with a higher parallel, my process needs more memory, and then the security checking in the kernel is raised, which is costly.

This might be confirmed by https://appleinsider.com/articles/24/03/21/apple-silicon-vulnerability-leaks-encryption-keys-and-cant-be-patched-easily

Luckily, rayon still works well.