rayon-rs / rayon

Rayon: A data parallelism library for Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to parallelize properly using `par_iter` or `par_bridge`

arnabanimesh opened this issue · comments

I am using a complex recursion based function. I want to run multiple instances of it across multiple cores and store the outputs in a vector of tuples. I have tested it on Windows 10 and 11 and the program is built using Rust (version: 1.77.1).

The program starts by using multiple cores (50-60%) , but gradually the usage drops until 10% (nearly 1 core worth of CPU) is being used. The task manager shows that all the cores are being used, but not at 100% and the memory usage (2GB) remains constant throughout the run. When I run the single core version, one core is able to reach 100% and the memory usage is 100-200 MB. Single core version is 2.5x faster than the multicore version. I have tried using both par_iter and par_bridge. I also tried to play around with max stack size but to no avail.

Is it possible for you to share your code? Or a similar reproducer? Otherwise, I can only guess at the cause...

The task manager shows that all the cores are being used, but not at 100%

With par_bridge, this could be the threads waiting on the Mutex for the sequential iterator input. If so, that means the iterator cannot produce items fast enough to keep the parallel part busy. Make sure you're doing as much as possible on the parallel side though -- like iter.map(...).par_bridge() vs iter.par_bridge().map(...).

With par_iter, rayon shouldn't be adding much blocking, but maybe you have a Mutex or similar synchronization of your own that's blocking progress?

I don't have mutex or anything. It is a fairly simple code. All the complexity is within the single threaded recursive function.Even the cache for storing results of recursive calls is implemented within the recursive function, so even that's not shared.

I can't share the original code as it is proprietary, but I'll try to create a minimal reproducible example where par_iter doesn't work by this weekend.

Is it possible you're running this in WSL? I recently tried running some code in WSL and because of syscalls it seems to slow down significantly

No. I am running on Windows only.

the most likely is that you have a severe load imbalance : few function calls (=10% of cores) take very long.

Each function call takes ~0.2 seconds on average (tested using single core Rust app). Worst case is 1-2 seconds.

Currently I am running it on multiple cores using python's multiprocessing module and C++ code packaged using pybind11 and it is able to hit 100% on all cores (tested on 16 core and 20 core intel systems).

I tried using tokio instead of rayon, but there also the same problem. I thought that maybe antivirus was causing the issue but then I ran it inside Windows sandbox, still the same problem.

My personal laptop has 8 cores and the multi core rayon based rust app is able to hit 30-40% there continuously. It is fluctuating a lot though.

the most likely is that you have a severe load imbalance : few function calls (=10% of cores) take very long.

It still doesn't explain the significant slowdown as compared to the single core version

I rewrote the existing C++ code in Rust without giving it much thought. Finally I got the time to investigate. The caching was implemented using level 4 nested vector array which contains a tuple of int and vector of structs. Basically index of each of the level would correspond to each of the function argument. I think it was bottlenecking memory bandwidth. Enough memory was already allocated early on so multiple allocations were not required. After I removed the cache it ran using all CPU cores. I rewrote the caching logic using a fast hashmap at a global level (kind of like redis), and now it is performing as it should.

TL;DR: Caching using deeply nested vector was impacting performance. Replaced it with hashmap based key value store. Now it is performing as it should.

I think this matter is done and dusted. Hence closing the issue.