ai2cm / fv3config

Manipulate FV3GFS run directories

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Writing to a symlink with xarray.to_netcdf has strange behavior

nbren12 opened this issue · comments

cc @oliverwm1

I just figure out what was happening. Saving a dataset with xarray.to_netcdf to an existing symlink alters the target of the symlink, rather than removing the symlink and then saving a new file.

Therefore, what was happening was

  1. make run directories for "simple" and "complex". these contain symlinks to the same input data files.
  2. Patch-in the "simple" surface data (this overwrites the data being pointed to).
  3. Patch-in the "complex" surface data

After step 3, the target of the symlinks contains the "complex" data, and both the "simple" and "complex" run directories point to this same data.

While this appears to be an xarray issue, maybe we shouldn't be using symlinks for the input data since it opens up the opportunity for these kinds of side-effects. The initial condition data is pretty small, so I don't see much overhead for copying.

I'd vote for just copying instead of symlinking to keep things simple. It might add considerable overhead for our matrix of configuration data and run targets when doing regression testing, but it should be catastrophic

The other option I'd suggest is to have a checksum step before symlinking when creating run directories (which errors if the data has changed, suggesting you refresh the downloaded data), and having a way to force copy on files when you explicitly want to change them or otherwise have them copied.

Hard links are another option provided the data is on the same filesystem. They are indistinguishable from a normal file.

I don't think I understand what a hard link versus a soft link is, but if you know it solves the problem and want us to go that route I'll trust you on it.

Nice catch on this issue @nbren12. What about copying the initial conditions, but still symlinking the forcing/grib files? For C48, the grib files are about 90% of the disk space for a run directory, so this would keep overhead down substantially.

Files all point to underlying data which is managed by the filesystem. A hard link is new file pointing the same data as another. It’s effectively a sibling to the original file. Because it is managed by the filesystem you can mv it freely within the same filesystem without having to worry about breaking relative paths for example.

OTOH, symlinks are basically a file contains a url to another path, which could reside on another filesystem altogether or even to a directory.

I think symlinking the grib files would work.

Another advantage of hardlinks is that they wouldn’t need any bind mounting magic.