etaler / Etaler

A flexable HTM (Hierarchical Temporal Memory) framework with full GPU support.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sum() does not parallelize well if it only produces 1 result

marty1885 opened this issue · comments

The current sum() implementation in both CPU and OpenCL backend parallelize base on how many results it needs to generate. So they only use 1 thread if we are summing the entire tensor into a single number. This is slow and inefficient.

Hand wavy analysis

I have came up with a reasonable parallelization strategy. For CPU: If there's more than num_threads/2 results to be generated, we assign 1 thread per result. Otherwise we assign num_thread/num_result.

For GPU: If num_result > num_compute_unit, we assign 1 processing element per result. Otherwise we assign a compute unit per result

Since we can't access the number of threads on TBB. I made it: if we are trying to generate 1 result, use all the cores on it.