pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.

Home Page:https://pangeo-forge.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve resource utilization/efficiency of file caching

jbusecke opened this issue · comments

Nothing super specific here, but wanted to brain dump and get a broader discussion going.

As part of my CMIP work my recipes often download many files from sometimes slow servers. This seems to take very long and frequently scales up to many workers, which increases cost.

Looking at the Dataflow resource metrics
image
it seems like there is one worker spun up per file? There is a spike in CPU useage initially, but then the worker idles around mostly.

Can we maybe modify the level of concurrency here and have one worker download/cache multiple files via threads to improve performance and/or save costs?

Perhaps something to chat about on Thu @ranchodeluxe @moradology ?

Can we maybe modify the level of concurrency here and have one worker download/cache multiple files via threads to improve performance and/or save costs?

Yes! I think this may do what you want:

https://github.com/google/xarray-beam/blob/main/xarray_beam/_src/threadmap.py