fsspec / kerchunk

Cloud-friendly access to archival data

Home Page:https://fsspec.github.io/kerchunk/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Single value variable of type int32 in NetCDF becomes float64 in Kerchunk

rsignell opened this issue · comments

@martindurant, looks like we still have a single-value variable problem.
In these AWS Open Data NetCDF files, the variable 'spherical' has a single int32 value but it becomes a float64 after kerchunk:
https://nbviewer.org/gist/rsignell-usgs/5971951d348496229ce121b52a2fb750

(I discovered this because the xroms package designed to work with these ROMS NetCDF files bombed -- took me a while to figure out this was the reason...)

I am fairly puzzled, the metadata says int:

>>> fs = fsspec.filesystem("reference", fo=single_json, remote_protocol="s3", remote_options=so)
>>> fs.cat("spherical/.zarray")
b'{"chunks":[],"compressor":null,"dtype":"<i4","fill_value":-2147483647,"filters":null,"order":"C","shape":[],"zarr_format":2}'

and zarr agrees:

>>> g = zarr.open(fs.get_mapper())
>>> g.spherical.dtype
dtype('int32')

xarray has a bunch of "decode*" flags in open_dataset, but I can't immediately see one that might do the right thing here.

The value, by the way, is just 1. This is actually a boolean?

I believe the reason is the fill_value. At the moment, float* is one of the few data types that can have missing values (using nan), while int* can't represent missing values. mask_and_scale=False should be what you're looking for, and I believe you can convert only the ones you need using:

In [20]: import xarray as xr
    ...: 
    ...: ds = xr.Dataset(
    ...:     {
    ...:         "a": ("x", [0, 1, 2], {"_FillValue": 1}),
    ...:         "b": ("x", [0.1, 0.2, 1.0], {"_FillValue": 1.0}),
    ...:     }
    ...: )
    ...: skipped_variables = [
    ...:     name
    ...:     for name, var in ds.variables.items()
    ...:     if "_FillValue" in var.attrs and var.dtype.kind not in "cfmMO"
    ...: ]
    ...: 
    ...: 
    ...: def decode_with_skip(ds, skip=None):
    ...:     if not skip:
    ...:         return xr.decode_cf(ds)
    ...: 
    ...:     return ds[skip].merge(xr.decode_cf(ds.drop_vars(skip)))
    ...: 
    ...: 
    ...: display(ds)
    ...: display(ds.pipe(decode_with_skip, skip=skipped_variables).compute())
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 24B 0 1 2
    b        (x) float64 24B 0.1 0.2 1.0
<xarray.Dataset> Size: 48B
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        (x) int64 24B 0 1 2
    b        (x) float64 24B 0.1 0.2 nan

(This might change with the custom dtypes in numpy, but it will take some effort to get working "nullable integer" dtypes)

@keewis : but the data here has an int fill_value and no _Fill_Value. Are you saying that having a fill value of any sort will cause a cast int->float even when there are actually no nulls?

Ah indeed, if I set the fill_value to null in the JSON, you get an int :|

zarr's fill_value is translated to the _FillValue attribute. The masking is applied without checking the actual values (which is potentially expensive) using where, and the mask value and the promoted dtypes are decided in xarray.core.dtypes.maybe_promote.