sum() does not parallelize well if it only produces 1 result

Question

sum() does not parallelize well if it only produces 1 result

marty1885 opened this issue 5 years ago · comments

The current sum() implementation in both CPU and OpenCL backend parallelize base on how many results it needs to generate. So they only use 1 thread if we are summing the entire tensor into a single number. This is slow and inefficient.

Martin Chang · Answer 1 · Mon Dec 09 2019 10:20:28 GMT+0800 (China Standard Time)

Hand wavy analysis

I have came up with a reasonable parallelization strategy. For CPU: If there's more than num_threads/2 results to be generated, we assign 1 thread per result. Otherwise we assign num_thread/num_result.

For GPU: If num_result > num_compute_unit, we assign 1 processing element per result. Otherwise we assign a compute unit per result

Martin Chang · Answer 2 · Mon Dec 09 2019 12:52:31 GMT+0800 (China Standard Time)

Since we can't access the number of threads on TBB. I made it: if we are trying to generate 1 result, use all the cores on it.