hypertidy / ncmeta

Tidy NetCDF metadata

Home Page:https://hypertidy.github.io/ncmeta/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

do Coordinates properly

mdsumner opened this issue · comments

We could have a new function nc_coords that identifies these relationships, similar to how nc_grids pulls out this implicit relation between variables on the shared spaces - it would identify links between pairs of grids.

What would you imagine the response from such a function would be? I need to build a function that will find the coordinate variables or auxiliary coordinate variables for a given data variable. -- not rocket science, but I'm not really sure what would be generally useful.

Just doodling,

nc_coords(nc, variable) { 

# logic logic logic

return(list(dim_name = coordvar_name, dim_name = coordvar_name, ... ))
}

Considerations:

  1. dimension order -- this is really important and can be a big stumbling block. I've developed a little trick to work with variable dimension order that basically enforces a package-standard dimension order. In RNetCDF parlance it looks something like:
dimid_order <- match(nc_var_info$dimids,
                     c(x_var_info$dimids,
                       y_var_info$dimids,
                       t_var_info$dimids))

data <- var.get.nc(nc, variable_name,
                   start = c(min(x_inds),
                             min(y_inds), i)[dimid_order],
                   count = c(length(x_inds),
                             length(y_inds), 1)[dimid_order])
  1. Some NetCDF files will have both 2d lat/lon (auxiliary) coordinate variables and 1d x/y coordinate variables (see daymet for example). My tendancy is to attempt to treat a grid as a raster and ignore the 2d lat/lon if possible, but there are some datasets that ONLY include the 2d lat/lon. (see stageiv precip for example)

Let me know what you think on these and I'll hack together a PR that gets this info pulled together for a few test datasets.

I don't generally know how to find the 2d coordinates of a 3d or 4d var, in practical terms I set a default and let the user override. This works really well in angstroms and I recently added ability to plot with "coords" to quadmesh::mesh_plot. Logic to find the coords is what I'm after. I have a transpose argument in angstroms too, not sure if it can be more automated.

Oh, I'd actually never noticed this little "coordinates" attribute before, it's all there (this from a ROMs model I have):

  float temp[xi_rho,eta_rho,s_rho,ocean_time]   
            long_name: potential temperature
            units: Celsius
            time: ocean_time
            coordinates: lon_rho lat_rho s_rho ocean_time
            field: temperature, scalar, series
            _FillValue: 9.99999993381581e+36
        float salt[xi_rho,eta_rho,s_rho,ocean_time]   
            long_name: salinity
            time: ocean_time
            coordinates: lon_rho lat_rho s_rho ocean_time
            field: salinity, scalar, series
            _FillValue: 9.99999993381581e+36
        

I'll have a bit more of a look and come back.

Yeah. Trick is, it's there when it is but many files don't include it.

Usually it's because the old COARDS specification said you could name a variable the same as a dimension to declare that it is a "coordinate variable". Looks like what you've got for nc_grids is right in line with what NetCDF Java has for their GRID data type -- I'm pretty sure xarray basically adopted this kind of model as well. Adding some smarts to look at the dimension name / variable name then the coordinates attributes of variables to figure out what all the nc_grids are, makes some sense.

Yet another side note here is that projections (when present) are always going to be associated with data variables and not the coordinate variables. So basically any code will need to start from data variables and spider out from there. Guessing you already know this, but it's worth mentioning.

All helpful! Fact is I have to think very hard about these topics, and always lean on using many examples to make sure I get it right. That's really interesting about projections.

Realizing that a grid is like a virtual table was a huge insight for me, the usual ncdump -h summary does not make that clear. GDAL has a lot this logic too, as must Panoply and Ferret - I don't think it exists in a clean module, but maybe xarray is something to look at. I don't have strong opinions about what should be done, yet.

I started playing around with this today. Opening a PR for review only -- I feel like I might be working around some of the stuff you've been doing otherwise. Still wrapping my head around the abstraction you've put in place here.

I realized I was over complicating things with the dimension order comment above... just did the X, Y, Z, T coordinate axes which are distinct from the actual dimensions of a netCDF file and should just be dealt with separately.

Re: #19 and #20, I come back to the last comment I made above. There is the basic relationship between a dimension and a coordinate-variable of the same name then there is canonical spatiotemporal axes that are required in order to map a given dataset onto some abstraction to be used for integration or whatever purpose (e.g. stars dimensions). CF "Coordinates" aim to support the latter using the former as one of a few tools. (noting that 2d coordinate variables don't work with dimension-name associations as an example)

My preference here is to have nc_coord_var address the canonical coordinates mapping to XYZT. Looking at nc_dims and nc_vars gets you to the fact that there is a 1d variable of the same name as a dimension so adding that information to the output of nc_coord_var seems superfluous?

In the interest of moving along I'll work that into a PR but feel free to decline if you want to take a different direction.

May want to consider bringing "../rasterwise/extdata/timeseries.nc" and "../rasterwise/extdata/high-dim/test-1.nc" in as just header info for tests in ncmeta. "../rasterwise/extdata/bad_examples_62/example3.nc" is also an edge case worth testing.

"../rasterwise/extdata/bad_examples_62/example3.nc"

is a situation where there are both auxiliary and normal coordinate variables. The way I support this now is by returning one row for the normal dimension-pair coordinate variables and another for the auxiliary coordinates. Seem reasonable?

# A tibble: 7 x 5
  variable X     Y     Z     T    
  <chr>    <chr> <chr> <chr> <chr>
1 time     NA    NA    NA    time 
2 lon      X     Y     NA    NA   
3 lat      X     Y     NA    NA   
4 X        X     NA    NA    NA   
5 Y        NA    Y     NA    NA   
6 pr       lon   lat   NA    NA   
7 pr       X     Y     NA    time