ecmwf / cfgrib

A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Opening grb files fails in xarray

HelixPiano opened this issue · comments

What happened?

Hello everyone,
I am not sure if this a bug in xarray or a bug with cfgrib, I will therefore crosspost it.

I have a grb file with the dimension 30316x160x392 , filetype float32 and filesize of around 3.7GB.

df= xr.open_dataset("129.grb", engine="cfgrib") works initially.
The problem is when I call df.max() it maxes out the RAM of my PC and fails to return any result.
RAM usage before df.max() call: 3.5/16GB

If I run df= xr.load_dataset("129.grb", engine="cfgrib") instead I get an error message:

What are the steps to reproduce the bug?

Version

0.9.10.3

Platform (OS and architecture)

Windows 10 Pro

Relevant log output

  File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\backends\api.py", line 264, in load_dataset
    return ds.load()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\dataset.py", line 760, in load
    v.load()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\variable.py", line 539, in load
    self._data = self._data.get_duck_array()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 695, in get_duck_array
    self._ensure_cached()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 689, in _ensure_cached
    self.array = as_indexable(self.array.get_duck_array())
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 663, in get_duck_array
    return self.array.get_duck_array()
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 550, in get_duck_array
    array = self.array[self.key]
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\xarray_plugin.py", line 156, in __getitem__
    return xr.core.indexing.explicit_indexing_adapter(
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 857, in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\xarray_plugin.py", line 165, in _getitem
    return self.array[key]
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\dataset.py", line 354, in __getitem__
    message = self.index.get_field(message_ids[0])  # type: ignore
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 484, in get_field
    return ComputedKeysAdapter(self.fieldset[message_id], self.computed_keys)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 344, in __getitem__
    return self.message_from_file(file, offset=item)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 340, in message_from_file
    return Message.from_file(file, offset, **kwargs)
  File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 93, in from_file
    file.seek(offset)
OSError: [Errno 22] Invalid argument

Accompanying data

No response

Organisation

No response

Hello @HelixPiano,

Thanks for the report. I could reproduce the high memory usage of the df.max() function using a large GRIB file that I have here. However, I then converted the GRIB file into NetCDF format and tried the same thing with this NetCDF file, which just uses plain xarray and not cfgrib, and the memory profile was similar (in fact the NetCDF version used more memory than the GRIB version).

I used ecCodes to perform the conversion:

grib_to_netcdf global_wind_2020_12.grib -o global_wind_2020_12.nc

So from this I'd have to conclude that cfgrib is not the culprit here, but xarray itself might be loading all the values arrays into memory at once in order to compute the maximum. Are you able to confirm this? If so, we should close this issue, and maybe you can raise one in xarray itself.

Cheers,
Iain

Thanks for reporting. I'm closing this now, and we can re-open it, or open a new one if we have a case where we can confirm that NetCDF does not show the same issue.

I think this problem is the same with #70 . I also had this problem when t>8.00, the file offset becomes -5.