fsspec / kerchunk

Cloud-friendly access to archival data

Home Page:https://fsspec.github.io/kerchunk/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NOAA NCEP Grib2 GFS & HRRR: levels, steps and duplicate variables!

emfdavid opened this issue · comments

It seems NCEP add several custom encodings to the WMO Grib2 standard that the CFGRIB library can't decode.

In the HRRR SubHourly product the Step is encoded as a range which breaks the CFGrib reader. When reading these variables with scan_grib you will get messages like:
2024-01-09T16:59:33.060Z MainProcess MainThread WARNING:grib2-to-zarr:Ignoring coordinate 'step' for varname 'vbdsf', raises: eccodes.WrongStepUnitError(Wrong units for step (step must be integer)) for variables: dswrf, vbdsf , tp , sdwe and unknown. Some of the grib messages really fail to decode - resulting in the unknowns. The step can be inferred from the runtime and the validtime of the model, but I think NCEP was trying to encode the duration of the average for the variables with stepType avg.

By comparing the results of using scan_grib and parsing the idx files provided by ncep, I was able to identify a few more edge cases. The table below shows some of the variables from gs://global-forecast-system/gfs.20231001/00/atmos/gfs.t00z.pgrb2.0p25.f006 which have duplicate variable name, step type, level type and level. Currently, the grib_tree method assumes these will be unique and silently takes the data from the last message in the file.

There are two types of duplicates I have found so far:

  1. The GFS grib2 files include two accumulations for Convective Precipitation and Total Precipitation. One is the accumulation during the current model step and one is the total accumulation during the forecast run so far. With the step value parsed by CFGrib this is ambiguous for all model horizons (0 to 240 hour forecast files). With the idx file (gs://global-forecast-system/gfs.20231001/00/atmos/gfs.t00z.pgrb2.0p25.f006.idx) we can see a bit more metadata ACPCP:surface:0-6 hour acc fcst but for the first few timesteps of the model, even the idx values appear to be duplicates because the total is equal to the step accumulation.
  2. There are several other variables that have level range such as 180-0 mb above ground and 0.44-1 sigma layer which decode as NaN with CFGrib (via kerchunk scan_grib). These result in additional duplicates which can confuse grib_tree (and anybody using it).
varname typeOfLevel stepType level offset_idx date attrs length_idx idx_uri grib_uri idx_indexed_at grib_crc32 grib_updated_at idx_crc32 idx_updated_at name step time valid_time uri offset_grib length_grib inline_value
acpcp surface accum 0.0 426078582 d=2023100100 ACPCP:surface:0-6 hour acc fcst:\n 279631 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective precipitation (water) 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 426078582 279631 None
acpcp surface accum 0.0 426358213 d=2023100100 ACPCP:surface:0-6 hour acc fcst:\n 279631 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective precipitation (water) 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 426358213 279631 None
cape pressureFromGroundLayer instant NaN 515902483 d=2023100100 CAPE:180-0 mb above ground:6 hour fcst:\n 530643 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective available potential energy 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 515902483 530643 None
cape pressureFromGroundLayer instant NaN 526644614 d=2023100100 CAPE:90-0 mb above ground:6 hour fcst:\n 479705 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective available potential energy 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 526644614 479705 None
cape pressureFromGroundLayer instant NaN 527482311 d=2023100100 CAPE:255-0 mb above ground:6 hour fcst:\n 514093 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective available potential energy 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 527482311 514093 None
cin pressureFromGroundLayer instant NaN 516433126 d=2023100100 CIN:180-0 mb above ground:6 hour fcst:\n 343271 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective inhibition 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 516433126 343271 None
cin pressureFromGroundLayer instant NaN 527124319 d=2023100100 CIN:90-0 mb above ground:6 hour fcst:\n 357992 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective inhibition 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 527124319 357992 None
cin pressureFromGroundLayer instant NaN 527996404 d=2023100100 CIN:255-0 mb above ground:6 hour fcst:\n 306931 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Convective inhibition 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 527996404 306931 None
r sigmaLayer instant NaN 518249965 d=2023100100 RH:0.33-1 sigma layer:6 hour fcst:\n 727263 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Relative humidity 0 days 06:00:00 2023-10-01 2023-10-01 06:00:00 gs://global-forecast-system/gfs.20231001/00/at... 518249965 727263 None
r sigmaLayer instant NaN 518977228 d=2023100100 RH:0.44-1 sigma layer:6 hour fcst:\n 714324 gs://global-forecast-system/gfs.20231001/00/at... gs://global-forecast-system/gfs.20231001/00/at... 2024-01-11 01:38:57.368924 iT+Wyg== 2023-10-01 03:34:14.440 fmnXTA== 2023-10-01 03:33:41.914 Relative humidity 0 days 06:00:00 2023-10-01 2023-10-01 06

Fixing the actual decoding of the variables is hard. It may be possible by adding custom ecCodes definitions.

In the mean time, I want this issue to exist in the world for anyone also wondering what is going on.

Suggestions on improving the behavior of grib_tree in the mean time would be welcome. At present it is silently picking the last grib message and using the data (offset and length) for the given variable. This might be more than a little surprising for some users.

NCEP team would like to expose their grib tables in a machine readable form!
See NOAA-EMC/NCEPLIBS-grib_util#293 (comment)
This would provide the data needed to generate the custom ecCodes definitions.