Extending xarray/pandas using accessors
pmav99 opened this issue · comments
@zacharyburnett @SorooshMani-NOAA as mentioned in the meeting, these are some notes for extending xarray/pandas. It's just copy paste from a different issue on a private repo, so not everything might be relevant (e.g. my suggestion in the end is probably out of context) but it should give you enough to get going
Relative links:
- https://xarray.pydata.org/en/stable/internals/extending-xarray.html
- pydata/xarray#1080
- pydata/xarray#1080 (comment)
Using dem
and adjust()
as an example, AFAIK, there are the following options when it comes to extending the upstream API:
dem_ds.adjust() # Monkey-patch / subclass
dem_ds.dem.adjust() # Register a `dem` specific accessor (i.e. different accessor per pyposeidon module)
dem_ds.poseidon.adjust() # Register a `poseidon` accessor (i.e. a single accessor for all pyposeidon modules)
dem_ds(adjust) # Monkey-patch `__call__()`
dem_ds.pipe(adjust) # Use `.pipe()`
adjust(dem_ds) # Just convert adjust to a function and be done with it
Directly subclassing/monkey-patching xarray
objects should be relatively simple, but the xarray
devs generally discourage it and suggest that accessors are used instead (see next point).
class MyDem(xr.Dataset)
def adjust(self):
...
The problem with registering accessors like e.g. ds.dem.adjust()
. is that the accessors are
global. Essentially, each accessor is a namespace. If we want to have different accessors for each
pyposeidon module, then we will be introducing multiple accessors. E.g.
ds.meteo.to_output()
ds.dem.adjust()
Registering a single accessor is IMHV also a problem since all the methods will be available
on all the Dataset objects. What's the point of calling meteo.poseidon.adjust()
?
Monkey-patching __call__()
I just plain dislike. No one expects it.
.pipe()
.
could be a solution if someone is really keen on chaining function calls. E.g.
ds.pipe(func1, arg1, arg2).pipe(func, kwarg1=1, kwarg2=2)
Another use case for pipe is if you want to dynamically decide which function to call (on runtime!). E.g.
def process(ds, func, *args, **kwargs):
return ds.pipe(func, *args, **kwargs)
Nevertheless, it is a somewhat obscure idiom that is also available in pandas. I guess that most
people don't know about it. In practical terms it means that you convert adjust()
, to_output()
etc as functions (which is not a bad idea since it will make writing tests for them somewhat
easier). When all things are considered, and since we don't chain a lot of calls, I don't really see it
as superior to a plain adjust(dem_ds)
All things being considered, I would suggest to proceed either with subclassing or with plain functions.
To be more precise, if it was my call, I would just go for adjust(dem_ds)
. In my experience, keeping things simple and
explicit usually gives more of a benefit in the long run. Furthermore, it makes testing easier + nothing
prevents you to expose a more Object Oriented API in the future.
Thank you, this is very informative!