Opening grb files fails in xarray
HelixPiano opened this issue · comments
What happened?
Hello everyone,
I am not sure if this a bug in xarray or a bug with cfgrib, I will therefore crosspost it.
I have a grb file with the dimension 30316x160x392 , filetype float32 and filesize of around 3.7GB.
df= xr.open_dataset("129.grb", engine="cfgrib")
works initially.
The problem is when I call df.max() it maxes out the RAM of my PC and fails to return any result.
RAM usage before df.max() call: 3.5/16GB
If I run df= xr.load_dataset("129.grb", engine="cfgrib")
instead I get an error message:
What are the steps to reproduce the bug?
Version
0.9.10.3
Platform (OS and architecture)
Windows 10 Pro
Relevant log output
File "C:\Program Files\JetBrains\PyCharm Community Edition 2022.3.2\plugins\python-ce\helpers\pydev\pydevconsole.py", line 364, in runcode
coro = func()
File "<input>", line 1, in <module>
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\backends\api.py", line 264, in load_dataset
return ds.load()
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\dataset.py", line 760, in load
v.load()
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\variable.py", line 539, in load
self._data = self._data.get_duck_array()
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 695, in get_duck_array
self._ensure_cached()
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 689, in _ensure_cached
self.array = as_indexable(self.array.get_duck_array())
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 663, in get_duck_array
return self.array.get_duck_array()
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 550, in get_duck_array
array = self.array[self.key]
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\xarray_plugin.py", line 156, in __getitem__
return xr.core.indexing.explicit_indexing_adapter(
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\xarray\core\indexing.py", line 857, in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\xarray_plugin.py", line 165, in _getitem
return self.array[key]
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\dataset.py", line 354, in __getitem__
message = self.index.get_field(message_ids[0]) # type: ignore
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 484, in get_field
return ComputedKeysAdapter(self.fieldset[message_id], self.computed_keys)
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 344, in __getitem__
return self.message_from_file(file, offset=item)
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 340, in message_from_file
return Message.from_file(file, offset, **kwargs)
File "C:\Users\xxx\miniconda3\envs\cnn\lib\site-packages\cfgrib\messages.py", line 93, in from_file
file.seek(offset)
OSError: [Errno 22] Invalid argument
Accompanying data
No response
Organisation
No response
Hello @HelixPiano,
Thanks for the report. I could reproduce the high memory usage of the df.max()
function using a large GRIB file that I have here. However, I then converted the GRIB file into NetCDF format and tried the same thing with this NetCDF file, which just uses plain xarray and not cfgrib, and the memory profile was similar (in fact the NetCDF version used more memory than the GRIB version).
I used ecCodes to perform the conversion:
grib_to_netcdf global_wind_2020_12.grib -o global_wind_2020_12.nc
So from this I'd have to conclude that cfgrib is not the culprit here, but xarray itself might be loading all the values arrays into memory at once in order to compute the maximum. Are you able to confirm this? If so, we should close this issue, and maybe you can raise one in xarray itself.
Cheers,
Iain
Thanks for reporting. I'm closing this now, and we can re-open it, or open a new one if we have a case where we can confirm that NetCDF does not show the same issue.
I think this problem is the same with #70 . I also had this problem when t>8.00, the file offset becomes -5.