sharded histogram filling with dask-histogram

Question

sharded histogram filling with dask-histogram

lgray opened this issue a year ago · comments

Right now dask-histogram generates a complete copy of the histogram-to-be-filled on each worker node that wants to fill a histogram.
This means we are presently limited in histogram size by the memory available to an individual dask worker.

Within the Wmass analysis on CMS boost-histogram has been extended (with little pain) to allow both multi-threaded fills and vector fills within the context of RDataFrame. Sequestering this functionality in one analysis on CMS, given the amount of data that will need to be processed for HL-LHC, is strategically ill-advised. There's already an example for int64 bins in boost-histogram on the multi-threading side of this.

We should find folks interested in following up @bendavid's work and propagating it up through boost-histogram and hist.

However, this suggests yet another direction, which is very oriented towards HPC-like distributed computing, where we scatter the bins of the histogram across the memory of all workers in a dask-like clusters. This brings the histogram-size limit to its logical extremum, which is nearly the total aggregate memory of the accumulated dask cluster.

Aim would be to discuss how to achieve this. I personally think this is important since it means you can achieve multi-gigabyte histograms on 2 GB / 1 core provisioned worker nodes, and it stands to wildly flatten the advantage of the "big machine" analysis workflow.

Benjamin Tovar commented a year ago

+1

Henry Schreiner commented a year ago

+1

Peter Fackeldey · Answer 1 · Sat Jul 08 2023 20:26:32 GMT+0800 (China Standard Time)

FYI: we solved this in our Dask cluster by using Dask's Actors (https://distributed.dask.org/en/stable/actors.html). We basically could outsource our histograms with these Actors from our Dask workers. That way, we just had a few workers (which own the actors) with unlimited RAM restrictions, while all other Dask workers in the cluster don't have to hold any histograms in memory. This is different to a sharded histogram, but solves the same underlying issue.

+1