pangeo-forge / pangeo-forge-recipes

To possibly bypass the memory issue I reported in #614, I've attempted to use kerchunk instead, based upon the HDF Reference Recipe for CMIP6, but also to no avail:

import os
from tempfile import TemporaryDirectory

import apache_beam as beam
import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
    CombineReferences,
    OpenWithKerchunk,
    WriteCombinedReference,
)

url_pattern = (
    "https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/"
    "MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/{time:%Y}/{time:%m}/"
    "ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-{time:%Y%m}01000000-fv2.00.nc"
)
months = pd.date_range("1995-08", "2020-12", freq=pd.offsets.MonthBegin())
urls = tuple(url_pattern.format(time=month) for month in months)
pattern = pattern_from_file_sequence(urls, "time", nitems_per_file=1).prune(2)

temp_dir = TemporaryDirectory()
target_root = temp_dir.name
store_name = "ceda-monthly-daytime-lst"
target_store = os.path.join(target_root, store_name)
os.mkdir(target_store)

transforms = (
    beam.Create(pattern.items())
    | OpenWithKerchunk(file_type=pattern.file_type)
    | CombineReferences(
        # Same error, regardless of inclusion/exclusion of "length_scale"
        # concat_dims=["time"], identical_dims=["lat", "lon", "channel", "length_scale"]
        concat_dims=["time"], identical_dims=["lat", "lon", "channel"]
    )
    | WriteCombinedReference(target_root=target_root, store_name=store_name)
)

print(f"{pattern=}")
print(f"{target_store=} (exists: {os.path.isdir(target_store)})")
print(f"{transforms=}")

with beam.Pipeline() as p:
    p | transforms  # type: ignore[reportUnusedExpression]

with xr.open_zarr(target_store) as ds:
    print(ds)

Running this version results in the following, so I don't know if there's a way to eliminate the problematic length_scale coordinate that lst_unc_sys uses. Also, I don't understand why I'm seeing the UserWarning about the time coord:

$ time python recipe-kerchunk.py
pattern=<FilePattern {'time': 2}>
target_store='/var/folders/v_/q9ql2x2n3dlg2td_b6xkcjzw0000gn/T/tmpdt_pu6g5/ceda-monthly-daytime-lst' (exists: True)
transforms=<_ChainedPTransform(PTransform) label=[Create|OpenWithKerchunk|CombineReferences|WriteCombinedReference] at 0x10e458510>
.../python3.11/site-packages/kerchunk/combine.py:269: UserWarning: Concatenated coordinate 'time' contains less than expectednumber of values across the datasets: [4.6008e+08]
...
.../python3.11/site-packages/kerchunk/combine.py", line 403, in second_pass
    raise ValueError(
ValueError: Found chunk size mismatch:
                        at prefix lst_unc_sys in iteration 1 (file None)
                        new chunk: [1, 1]
                        chunks so far: [1]

real    12m39.490s
user    0m53.177s
sys     0m15.788s

Originally posted by @chuckwondo in #614 (comment)

@chuckwondo Thanks for opening up a separate issue! I'll see if I can replicate you error and can provide any help.

I found what I believe should be a valid means of removing the offending "lst_unc_sys" variable, but this leads to a PicklingError:

import os
from tempfile import TemporaryDirectory

import apache_beam as beam
import pandas as pd
import xarray as xr
from kerchunk.combine import drop  # import handy-dandy drop function
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
    CombineReferences,
    OpenWithKerchunk,
    WriteCombinedReference,
)

url_pattern = (
    "https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/"
    "MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/{time:%Y}/{time:%m}/"
    "ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-{time:%Y%m}01000000-fv2.00.nc"
)
months = pd.date_range("1995-08", "2020-12", freq=pd.offsets.MonthBegin())
urls = tuple(url_pattern.format(time=month) for month in months)
pattern = pattern_from_file_sequence(urls, "time").prune(2)

temp_dir = TemporaryDirectory()
target_root = temp_dir.name
store_name = "ceda-monthly-daytime-lst"
target_store = os.path.join(target_root, store_name)
os.mkdir(target_store)

transforms = (
    beam.Create(pattern.items())
    | OpenWithKerchunk(file_type=pattern.file_type)
    | CombineReferences(
        concat_dims=["time"],
        identical_dims=["lat", "lon", "channel"],
        mzz_kwargs={"preprocess": drop("lst_unc_sys")},  # drop offending var
    )
    | WriteCombinedReference(target_root=target_root, store_name=store_name)
)

print(f"{pattern=}")
print(f"{pattern.file_type=}")
print(f"{target_store=} (exists: {os.path.isdir(target_store)})")
print(f"{transforms=}")

with beam.Pipeline() as p:
    p | transforms  # type: ignore[reportUnusedExpression]

with xr.open_zarr(target_store) as ds:
    print(ds)

Running the above produces the following:

$ time python recipe-kerchunk-1.py 
pattern=<FilePattern {'time': 2}>
pattern.file_type=<FileType.netcdf4: 'netcdf4'>
target_store='/var/folders/v_/q9ql2x2n3dlg2td_b6xkcjzw0000gn/T/tmp4bvimm78/ceda-monthly-daytime-lst' (exists: True)
transforms=<_ChainedPTransform(PTransform) label=[Create|OpenWithKerchunk|CombineReferences|WriteCombinedReference] at 0x167df1d90>
.../python3.11/site-packages/kerchunk/combine.py:269: UserWarning: Concatenated coordinate 'time' contains less than expectednumber of values across the datasets: [4.6008e+08]
  warnings.warn(
Traceback (most recent call last):
...
  File "apache_beam/coders/coder_impl.py", line 270, in apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
  File ".../python3.11/site-packages/apache_beam/coders/coders.py", line 869, in <lambda>
    lambda x: dumps(x, protocol), pickle.loads)
              ^^^^^^^^^^^^^^^^^^
AttributeError: Can't pickle local object 'drop.<locals>.preproc'

real    22m11.543s
user    0m21.818s
sys     0m6.495s

Since the pickling error indicates that the issue concerns a nested function (drop.<locals>.preproc), I duplicated the (minimal) nested logic into my own top level function in an attempt to allow pickling, but this still produces a pickling error, to my dismay.

However, I also enabled kerchunk debug logging in an attempt to gain a bit of insight, and it seems to reveal that the function to drop the offending var successfully drops the var from the 1st of the 2 datasets pruned, presumably because it occurs in the main process, but the attempt to pickle the function to send to a separate process for handling the 2nd of the 2 datasets fails.

import logging
import os
from tempfile import TemporaryDirectory

import apache_beam as beam
import pandas as pd
import xarray as xr
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
    CombineReferences,
    OpenWithKerchunk,
    WriteCombinedReference,
)

logging.getLogger("kerchunk").addHandler(logging.StreamHandler())
logging.getLogger("kerchunk").setLevel(logging.DEBUG)

url_pattern = (
    "https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/"
    "MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/{time:%Y}/{time:%m}/"
    "ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-{time:%Y%m}01000000-fv2.00.nc"
)
months = pd.date_range("1995-08", "2020-12", freq=pd.offsets.MonthBegin())
urls = tuple(url_pattern.format(time=month) for month in months)
pattern = pattern_from_file_sequence(urls, "time").prune(2)

temp_dir = TemporaryDirectory()
target_root = temp_dir.name
store_name = "ceda-monthly-daytime-lst"
target_store = os.path.join(target_root, store_name)
os.mkdir(target_store)


def drop_lst_unc_sys(refs):
    for k in list(refs):
        if k.startswith("lst_unc_sys"):
            refs.pop(k)
    return refs


transforms = (
    beam.Create(pattern.items())
    | OpenWithKerchunk(file_type=pattern.file_type)
    | CombineReferences(
        concat_dims=["time"],
        identical_dims=["lat", "lon", "channel"],
        mzz_kwargs={"preprocess": drop_lst_unc_sys},  # drop offending var
    )
    | WriteCombinedReference(target_root=target_root, store_name=store_name)
)

print(f"{pattern=}")
print(f"{pattern.file_type=}")
print(list(pattern.items()))
print(f"{target_store=} (exists: {os.path.isdir(target_store)})")
print(f"{transforms=}")

with beam.Pipeline() as p:
    p | transforms  # type: ignore[reportUnusedExpression]

with xr.open_zarr(target_store) as ds:
    print(ds)

Here is the output from running this modified version, with kerchunk debug logging:

$ time python recipe-kerchunk-2.py 
pattern=<FilePattern {'time': 2}>
pattern.file_type=<FileType.netcdf4: 'netcdf4'>
[({Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=0, indexed=False)}, 'https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/1995/08/ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-19950801000000-fv2.00.nc'), ({Dimension(name='time', operation=<CombineOp.CONCAT: 2>): Position(value=1, indexed=False)}, 'https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/1995/09/ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-19950901000000-fv2.00.nc')]
target_store='/var/folders/v_/q9ql2x2n3dlg2td_b6xkcjzw0000gn/T/tmps7o4aobj/ceda-monthly-daytime-lst' (exists: True)
transforms=<_ChainedPTransform(PTransform) label=[Create|OpenWithKerchunk|CombineReferences|WriteCombinedReference] at 0x16a7c6fd0>
Concat dims: ['time']
DEBUG:kerchunk.combine:Concat dims: ['time']
Coord map: {'time': 'data:time'}
DEBUG:kerchunk.combine:Coord map: {'time': 'data:time'}
setup filesystems
DEBUG:kerchunk.combine:setup filesystems
First pass: 0
DEBUG:kerchunk.combine:First pass: 0
Decode: ('data:time', 0, 'time', None) -> [4.6008e+08]
DEBUG:kerchunk.combine:Decode: ('data:time', 0, 'time', None) -> [4.6008e+08]
.../python3.11/site-packages/kerchunk/combine.py:269: UserWarning: Concatenated coordinate 'time' contains less than expectednumber of values across the datasets: [4.6008e+08]
  warnings.warn(
Created coordinates map
DEBUG:kerchunk.combine:Created coordinates map
Written coordinates
DEBUG:kerchunk.combine:Written coordinates
Written global metadata
DEBUG:kerchunk.combine:Written global metadata
Decode: ('data:time', 0, 'time', None) -> [4.6008e+08]
DEBUG:kerchunk.combine:Decode: ('data:time', 0, 'time', None) -> [4.6008e+08]
Second pass: 0, dtime
DEBUG:kerchunk.combine:Second pass: 0, dtime
Second pass: 0, lcc
DEBUG:kerchunk.combine:Second pass: 0, lcc
Second pass: 0, lst
DEBUG:kerchunk.combine:Second pass: 0, lst
Second pass: 0, lst_unc_loc_atm
DEBUG:kerchunk.combine:Second pass: 0, lst_unc_loc_atm
Second pass: 0, lst_unc_loc_cor
DEBUG:kerchunk.combine:Second pass: 0, lst_unc_loc_cor
Second pass: 0, lst_unc_loc_sfc
DEBUG:kerchunk.combine:Second pass: 0, lst_unc_loc_sfc
Second pass: 0, lst_unc_ran
DEBUG:kerchunk.combine:Second pass: 0, lst_unc_ran
Second pass: 0, lst_uncertainty
DEBUG:kerchunk.combine:Second pass: 0, lst_uncertainty
Second pass: 0, n
DEBUG:kerchunk.combine:Second pass: 0, n
Second pass: 0, sataz
DEBUG:kerchunk.combine:Second pass: 0, sataz
Second pass: 0, satze
DEBUG:kerchunk.combine:Second pass: 0, satze
Second pass: 0, solaz
DEBUG:kerchunk.combine:Second pass: 0, solaz
Second pass: 0, solze
DEBUG:kerchunk.combine:Second pass: 0, solze
Concat dims: ['time']
DEBUG:kerchunk.combine:Concat dims: ['time']
Coord map: {'time': 'data:time'}
DEBUG:kerchunk.combine:Coord map: {'time': 'data:time'}
Traceback (most recent call last):
...
  File "apache_beam/coders/coder_impl.py", line 270, in apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
  File ".../python3.11/site-packages/apache_beam/coders/coders.py", line 869, in <lambda>
    lambda x: dumps(x, protocol), pickle.loads)
              ^^^^^^^^^^^^^^^^^^
_pickle.PicklingError: Can't pickle <function drop_lst_unc_sys at 0x16b7e1260>: it's not the same object as __main__.drop_lst_unc_sys

real    12m27.586s
user    0m28.634s
sys     0m8.164s

@cisaacstern is this a case where cloudpickle would help?

I'm not sure. FWIW, this exact use case is captured in our integration tests here

pangeo-forge-recipes/examples/feedstock/hrrr_kerchunk_concat_valid_time.py

Line 70 in c292777

mzz_kwargs=dict(preprocess=drop_unknown),

But I am not running that test currently precisely because I hit this same picking issue:

pangeo-forge-recipes/tests/test_integration.py

Line 65 in c292777

    
           "hrrr-kerchunk-concat-valid-time": "Can't serialize drop_unknown callback function.",

I suspect cloudpickle may not be supported on the local direct runner (but it's worth further investigation). It's possible this is also solvable by changing the return type used in our reference combiner:

pangeo-forge-recipes/pangeo_forge_recipes/combiners.py

Line 88 in c292777

return self.to_mzz(references)

Rather than MultiZarrToZarr objects, we could call .translate on those objects before returning them, to get a more-easily-serializable dict representation of the same data.

Any more thoughts on how to address this? I'd love to get this working, but in lieu of this approach, I may simply proceed with the approach in #614, now that I've got that one working (thank you @rabernat).

@chuckwondo unfortunately I do not have a good sense of how to fix this at the moment. I've renamed this issue to reflect the root error here, which will take some greater effort to resolve. In the meantime, I suggest proceeding with #614 as you suggest.

@cisaacstern, I believe I found how this is supposed to be addressed, but it still seems to have no effect. My initial sense of why this still doesn't work is that there may be a bug in Apache Beam, but I have further digging to do.

When constructing a Pipeline, you can pass options, and one such option is save_main_session, which is supposed to deal exactly with this picking scenario, yet when I set this option to True, I still get the same pickling error.

Here are a couple of examples (I suspect there are more) showing the use of the save_main_session option in the beam repo itself:

In addition, I found some other beam issues that seem related, and indicate that perhaps using "cloudpickle" as the pickle library, when "save_main_session" is True, might do the trick. Unfortunately, specifying cloudpickle seems to have no effect either (it doesn't even seem to be used because the pickling exception still occurs at the same point -- i.e., cloudpickle doesn't appear to be used at all).

Oddly, when I specify an unknown pickling library, beam complains (e.g., ValueError: Unknown pickler library: gerkin), so I don't know why specifying "cloudpickle" appears to make no difference (i.e., pickling still fails, but cloudpickle isn't used), yet causes no ValueError.

Here is the latest version of my code that makes use of these options, but still fails with the identical pickling error:

from pathlib import Path
from typing import Any, Mapping

import apache_beam as beam
import pandas as pd
from apache_beam.options.pipeline_options import PipelineOptions
from kerchunk.combine import drop
from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
    CombineReferences,
    OpenWithKerchunk,
    WriteCombinedReference,
)


def drop_lst_unc_sys(refs: Mapping[str, Any]) -> Mapping[str, Any]:
    return {k: v for k, v in refs.items() if not k.startswith("lst_unc_sys")}


# Monthly data spans from 1995-08 through 2020-12, with the exception of two
# gaps: (1) 1996-01 through 1996-06 and (2) 2001-02 through 2001-06.
months = (
    pd.date_range("1995-08", "2020-12", freq=pd.offsets.MonthBegin())
    .difference(pd.date_range("1996-01", "1996-06", freq=pd.offsets.MonthBegin()))
    .difference(pd.date_range("2001-02", "2001-06", freq=pd.offsets.MonthBegin()))
)

url_pattern = (
    "https://dap.ceda.ac.uk/neodc/esacci/land_surface_temperature/data/"
    "MULTISENSOR_IRCDR/L3S/0.01/v2.00/monthly/{time:%Y}/{time:%m}/"
    "ESACCI-LST-L3S-LST-IRCDR_-0.01deg_1MONTHLY_DAY-{time:%Y%m}01000000-fv2.00.nc"
)
urls = tuple(url_pattern.format(time=month) for month in months)
pattern = pattern_from_file_sequence(urls, "time").prune(1)

target_root = Path(__file__).parent.absolute()
store_name = "ceda-monthly-daytime-lst"
(target_root / store_name).mkdir(parents=True, exist_ok=True)

transforms = (
    beam.Create(pattern.items())
    | OpenWithKerchunk(file_type=pattern.file_type)
    | CombineReferences(
        concat_dims=["time"],
        identical_dims=["lat", "lon", "channel"],
        mzz_kwargs={"preprocess": drop_lst_unc_sys},
        # Using next line instead of previous line produces same pickling error
        # mzz_kwargs={"preprocess": drop("lst_unc_sys")},
    )
    | WriteCombinedReference(str(target_root), store_name)
)

options = PipelineOptions(pickle_library="cloudpickle", save_main_session=True)

with beam.Pipeline(options=options) as p:
    _ = p | transforms

Further, I'm using the following (via conda-forge):

python 3.11.5
apache_beam 2.50.0
cloudpickle 2.2.1
pangeo-forge-recipes 0.10.0 (NOTE: 0.10.1 [3 weeks ago] and 0.10.2 [yesterday] are absent from conda-forge -- is this an oversight?)

Given that this seems to be an Apache Beam problem, shall I open a new issue over there?

@chuckwondo thanks for digging into this further. Responses below.

(NOTE: 0.10.1 [3 weeks ago] and 0.10.2 [yesterday] are absent from conda-forge -- is this an oversight?)

Yes 0.10.1 was an oversight, forgot to merge conda-forge/pangeo-forge-recipes-feedstock#15 🤭 (just did this, so 0.10.1 should be available shortly)
0.10.2 errored on release, I fixed that just now, so it's currently available via pip and will be on conda-forge in a day or so.

and one such option is save_main_session, which is supposed to deal exactly with this picking scenario

When I attended the Beam Summit 2022 I spoke with a number of Beam maintainers who privately expressed skepticism about this option, describing it as a "hack", which I mention to say that if it doesn't perform as expected, that's perhaps expected.

These same maintainers pointed to using a setup.py file as a more robust way of capturing the local context for a pipeline. Perhaps this is worth a try. Though the docs describe it as a way of managing "multiple file dependencies", IIUC it may also help with capturing the module-level namespace as well. We have an open issue to support this in the Pangeo Forge CLI.

specifying "cloudpickle" appears to make no difference (i.e., pickling still fails, but cloudpickle isn't used)

Not all pipeline options are supported by all runners, so I suspect the issue here may be that the local direct runner does not support cloudpickle? We do use cloudpickle in the CLI for GCP Dataflow deployments, where I can confirm that it definitely does get used, and has solved certain pickling issues for us there.

Given that this seems to be an Apache Beam problem, shall I open a new issue over there?

You're certainly welcome to, and it probably is a good idea to bring this issue to their attention, though at the risk of stating the obvious, I wouldn't expect things to move quickly there, given the scale of the project. If you do want to open the issue, IMHO:

The title would be something along the lines of "Is cloudpickle supported by the local direct runner?"
I would distill your example above into an MRE that has nothing to do with Pangeo Forge, and is as simple as possible, using pure Beam.

In terms of ideas for unblocking this without some miraculous fix by the Beam maintainers, I would suggest as a next step trying this with a setup.py as described above, which should be a low-ish effort experiment to run. Very curious what you find there!

Aha! It is indeed a bug in Apache Beam, which I just reported (and have a fix, as described in the bug): apache/beam#28558

When constructing a Pipeline, you can pass options, and one such option is save_main_session, which is supposed to deal exactly with this picking scenario, yet when I set this option to True, I still get the same pickling error.

In addition, I found some other beam issues that seem related, and indicate that perhaps using "cloudpickle" as the pickle library, when "save_main_session" is True, might do the trick. Unfortunately, specifying cloudpickle seems to have no effect either (it doesn't even seem to be used because the pickling exception still occurs at the same point -- i.e., cloudpickle doesn't appear to be used at all).

Hi @chuckwondo and @cisaacstern, I think I had a similar problem with cloudpickle being ignored when save_main_session is True: #384 (comment)

I used a local workaround which resolved it at the time: apache/beam#21615 (comment)., where I modified https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L570 as follows:

try:
  pickle_library = self._options.view_as(SetupOptions).pickle_library
  if pickle_library:
    pickler.set_library(pickle_library)
  pickler.dump_session(os.path.join(tmpdir, 'main_session.pickle'))

The issue has apparently been fixed so the above line number 570 in pipeline.py is likely no longer accurate, but it's probably close enough. If the fix isn't in your local Beam version then this workaround might help.

Hi @derekocallaghan, thanks for jumping in. It appears that fix has made it into the version I'm using (2.50.0), although it appears early on in Pipeline.__init__ in a slightly different form than what you shared above.

What I encountered is that when Pipeline.run is executed, that value that is set for the pickle library is completely unused. See more details at the bug I filed on Apache Beam: apache/beam#28558

Passing preprocessor to MultiZarrToZarr raises PicklingError