pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.

Home Page:https://pangeo-forge.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`DetermineSchema` throws `UnicodeEncodeError` if en-dash in ds.attrs

jbusecke opened this issue · comments

@cisaacstern and I just debugged a class of failed CMIP6 recipes and came up with a relatively easy reproducer:

from pangeo_forge_recipes.patterns import pattern_from_file_sequence
from pangeo_forge_recipes.transforms import (
    OpenURLWithFSSpec, OpenWithXarray, DetermineSchema
)
import apache_beam as beam

urls = [
    'https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r17i1p1f1/SImon/sifb/gn/v20210901/sifb_SImon_MPI-ESM1-2-LR_historical_r17i1p1f1_gn_199001-200912.nc',
]

pattern = pattern_from_file_sequence(urls,concat_dim='time')
recipe = (
    beam.Create(pattern.items())
        | OpenURLWithFSSpec()
        | OpenWithXarray(xarray_open_kwargs={"use_cftime":True})
        | DetermineSchema(combine_dims = pattern.combine_dim_keys)
               )
with beam.Pipeline() as p:
    p | recipe

gives:

/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/pangeo_forge_recipes/openers.py:53: UserWarning: Unknown file type specified without xarray engine, backend engine will be automatically selected by xarray
  warnings.warn(
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1423, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 625, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1607, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1720, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 263, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 208, in apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
  File "apache_beam/runners/worker/opcounters.py", line 213, in apache_beam.runners.worker.opcounters.OperationCounters.update_from
  File "apache_beam/runners/worker/opcounters.py", line 265, in apache_beam.runners.worker.opcounters.OperationCounters.do_sample
  File "apache_beam/coders/coder_impl.py", line 1495, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1506, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1055, in apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 379, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 443, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 443, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 413, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 61-63: surrogates not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jovyan/AREAS/CMIP6-PGF/reproducer.py", line 24, in <module>
    p | recipe
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/pipeline.py", line 600, in __exit__
    self.result = self.run()
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/pipeline.py", line 577, in run
    return self.runner.run_pipeline(self, self._options)
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/direct/direct_runner.py", line 129, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 202, in run_pipeline
    self._latest_run_result = self.run_via_runner_api(
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 224, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 455, in run_stages
    bundle_results = self._execute_bundle(
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 783, in _execute_bundle
    self._run_bundle(
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1012, in _run_bundle
    result, splits = bundle_manager.process_bundle(
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 1348, in process_bundle
    result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 379, in push
    response = self.worker.do_instruction(request)
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 625, in do_instruction
    return getattr(self, request_type)(
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py", line 663, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1040, in process_bundle
    input_op_by_transform_id[element.transform_id].process_encoded(
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/apache_beam/runners/worker/bundle_processor.py", line 232, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 568, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 570, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 261, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 264, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 951, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 952, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1425, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1513, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1423, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 625, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1607, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1720, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 264, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 951, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 952, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1425, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1513, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1423, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 625, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1607, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1720, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 264, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 951, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 952, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1425, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1513, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1423, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 839, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 983, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "apache_beam/runners/common.py", line 1607, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1720, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 264, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 951, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 952, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1425, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1513, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1423, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 839, in apache_beam.runners.common.PerWindowInvoker.invoke_process
  File "apache_beam/runners/common.py", line 983, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
  File "apache_beam/runners/common.py", line 1607, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1720, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 264, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 951, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 952, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1425, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1533, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1423, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 625, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1607, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1720, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 263, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 208, in apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
  File "apache_beam/runners/worker/opcounters.py", line 213, in apache_beam.runners.worker.opcounters.OperationCounters.update_from
  File "apache_beam/runners/worker/opcounters.py", line 265, in apache_beam.runners.worker.opcounters.OperationCounters.do_sample
  File "apache_beam/coders/coder_impl.py", line 1495, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1506, in apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1055, in apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 379, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 443, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 443, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 413, in apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
UnicodeEncodeError: 'utf-8 [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|DetermineSchema/DetermineSchema/Map(wrapper)']' codec can't encode characters in position 61-63: surrogates not allowed

We narrowed this down to some issue with the dataset attributes, more specifically the reference attribute:

'MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI‐M Earth System Model version 1.2 (MPI‐ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,Mueller, W.A. et al. (2018): A high‐resolution version of the Max Planck Institute Earth System Model MPI‐ESM1.2‐HR. J. Adv. Model. EarthSyst.,10,1383–1413, doi:10.1029/2017MS001217'

There seems to be different types of dashes used here which trip up the encoding.

We concluded this needs a custom sanitizer stage in the CMIP6 recipe, but wanted to leave this issue up if other PGF users came across it.

Here's a way to sanitize en-dashes:

"MPI‐M".encode("utf-8").replace(b"\xe2\x80\x90", b"-").decode()

The above did not work, it basically triggered the original error on a different line.

    wrapper = lambda x: [fn(x)]
  File "/home/jovyan/AREAS/CMIP6-PGF/reproducer.py", line 20, in _strip_attrs
    new_value = att_value.encode("utf-8").replace(b"\xe2\x80\x90", b"-").decode()
UnicodeEncodeError: 'utf-8 [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|StripAttrs|DetermineSchema/StripAttrs/Strip Attrs']' codec can't encode characters in position 61-63: surrogates not allowed
Exception ignored in: <function File.close at 0x7f1dca683940>
Traceback (most recent call last):
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/h5netcdf/core.py", line 1200, in close
  File "/srv/conda/envs/cmip6-leap-feedstock/lib/python3.9/site-packages/h5py/_hl/files.py", line 585, in close
TypeError: bad operand type for unary ~: 'NoneType'

What worked is either

.encode("utf-8", 'ignore').decode()
.encode("utf-8", 'replace').decode()

The latter fills ??? for offending characters. Aside from this triggering some serious cravings for my favorite childhood audiobooks I think in the example above that this sanitized version (with (.., 'replace')):

Sanitized datasets attributes field references: 
 MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI‐M Earth System Model version 1.2 (MPI‐ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A high‐resolution version of the Max Planck Institute Earth System Model MPI‐ESM1.2‐HR. J. Adv. Model. EarthSyst.,10,1383–1413, doi:10.1029/2017MS001217 
 ----> 
 MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI???M Earth System Model version 1.2 (MPI???ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A high???resolution version of the Max Planck Institute Earth System Model MPI???ESM1.2???HR. J. Adv. Model. EarthSyst.,10,1383???1413, doi:10.1029/2017MS001217

more clearly indicates that things were replaced than this (with(..., 'ignore'):

Sanitized datasets attributes field references: 
 MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPI‐M Earth System Model version 1.2 (MPI‐ESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A high‐resolution version of the Max Planck Institute Earth System Model MPI‐ESM1.2‐HR. J. Adv. Model. EarthSyst.,10,1383–1413, doi:10.1029/2017MS001217 
 ----> 
 MPI-ESM: Mauritsen, T. et al. (2019), Developments in the MPIM Earth System Model version 1.2 (MPIESM1.2) and Its Response to Increasing CO2, J. Adv. Model. Earth Syst.,11, 998-1038, doi:10.1029/2018MS001400,
Mueller, W.A. et al. (2018): A highresolution version of the Max Planck Institute Earth System Model MPIESM1.2HR. J. Adv. Model. EarthSyst.,10,13831413, doi:10.1029/2017MS001217

you've got two different versions of dash that make use of surrogates: b"\xe2\x80\x90" and b"\xe2\x80\x93". To replace both, you'd have to:

s = "..."
s.encode("utf-8").replace(b"\xe2\x80\x90", b"-").replace(b"\xe2\x80\x93", b"-").decode("utf-8")

This also works for me:

s.replace("‐", "-").replace("–", "-")

(use .encode("utf-8") to verify that any multi-byte characters are gone)

Edit: here's some more details of what those three bytes mean, in case you're interested (i.e. the encoding rules for UTF-8). If we look at the binary representation of the first byte (\xe2), we get 11100010, where the number of set bits at the beginning followed by a 0 tells us the number of total bytes (the only exception is 1 byte which starts with just 0 to be compatible with ascii – the 10 is the data byte prefix, see below). In other words:

nbytes prefix data bits
1 0 7
2 110 5
3 1110 4
4 11110 3

In this case the prefix is 1110, so three bytes, leaving us with 0010 as data. The other bytes always start with 10, so after removing that from \x80 (10000000) we get 000000, and from \x90 we get 010000. That gives us 0010000000010000, or U+2010, which is the unicode code point for "hyphen". b"\xe2\x80\x93" is then U+2013, or "en dash".

(Deleted last comment because I realized you were saying that there are two forms of the dash that use surrogates!)

I think for now I am ok with replacing all weird characters with ??? (I hope I am understanding my solution up top correctly) in my use-case, but this is super helpful @keewis.
@cisaacstern I wonder if this is something for -recipes to take care of properly? Maybe I am just too tired to think about string en/decoding rn and will pick that up again tomorrow, haha.