pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.

Home Page:https://pangeo-forge.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High level problem case: Files with A LOT OF VARIABLES

jbusecke opened this issue · comments

I have been working on refactoring the community bakery at LEAP (#735) and have one interesting problem case here: https://github.com/leap-stc/wavewatch3_feedstock (particularly see the code in leap-stc/wavewatch3_feedstock#1

This dataset is different from many others in at least two ways AFAICT now:

  • The files are extremely heavily compressed (3GB file, 17GB in memory)
  • A TON of variables!

Together this blows up the memory. I have tested running the recipe with dropping every variable but one and it works fine (still consumes a lot of memory but succeeds fine).

I think at the base the problem here is that a fragment with ~100MB chunksize on a single variable is still extremly large (~2-3GB) and as such the workers try to load a bunch of them eagerly and blow up.

I tried just throwing more RAM at the problem (800GB RAM was not enough!!!), but this dataset is very large in total and I think eventually I would have to be able to load the whole thing into memory, which really is not the point of doing this.

My current suspicion is that for cases like this we might want to consider not only splitting fragments out by dimension indicies, but also splitting across variables? Not at all sure how to achieve this, but wanted to record this as an interesting failcase.

"800GB RAM was not enough!!!" 😲