Unbalance inner loops
yuki-koyama opened this issue · comments
About the parallel_for
function, the current algorithm to assign inner loops to each thread is not well designed and can produce unbalanced assignments. For example, suppose the following case:
- 1050 loops
- 100 threads
The thread no.1 to no.99 will be assigned 10 inner loops, but the last thread no.100 will be assigned 60 inner loops. Obviously, the last thread can be the bottleneck.
This should be fixed somehow.