Extending xarray/pandas using accessors

Question

Extending xarray/pandas using accessors

pmav99 opened this issue 2 years ago · comments

@zacharyburnett @SorooshMani-NOAA as mentioned in the meeting, these are some notes for extending xarray/pandas. It's just copy paste from a different issue on a private repo, so not everything might be relevant (e.g. my suggestion in the end is probably out of context) but it should give you enough to get going

Relative links:

Using dem and adjust() as an example, AFAIK, there are the following options when it comes to extending the upstream API:

dem_ds.adjust()             # Monkey-patch / subclass
dem_ds.dem.adjust()         # Register a `dem` specific accessor (i.e. different accessor per pyposeidon module)
dem_ds.poseidon.adjust()    # Register a `poseidon` accessor (i.e. a single accessor for all pyposeidon modules)
dem_ds(adjust)              # Monkey-patch `__call__()`
dem_ds.pipe(adjust)         # Use `.pipe()`
adjust(dem_ds)              # Just convert adjust to a function and be done with it

Directly subclassing/monkey-patching xarray objects should be relatively simple, but the xarray
devs generally discourage it and suggest that accessors are used instead (see next point).

class MyDem(xr.Dataset)
    def adjust(self):
        ...

The problem with registering accessors like e.g. ds.dem.adjust(). is that the accessors are
global. Essentially, each accessor is a namespace. If we want to have different accessors for each
pyposeidon module, then we will be introducing multiple accessors. E.g.

ds.meteo.to_output()
ds.dem.adjust()

Registering a single accessor is IMHV also a problem since all the methods will be available
on all the Dataset objects. What's the point of calling meteo.poseidon.adjust() ?

Monkey-patching __call__() I just plain dislike. No one expects it.

.pipe().
could be a solution if someone is really keen on chaining function calls. E.g.

ds.pipe(func1, arg1, arg2).pipe(func, kwarg1=1, kwarg2=2)

Another use case for pipe is if you want to dynamically decide which function to call (on runtime!). E.g.

def process(ds, func, *args, **kwargs):
    return ds.pipe(func, *args, **kwargs)

Nevertheless, it is a somewhat obscure idiom that is also available in pandas. I guess that most
people don't know about it. In practical terms it means that you convert adjust(), to_output()
etc as functions (which is not a bad idea since it will make writing tests for them somewhat
easier). When all things are considered, and since we don't chain a lot of calls, I don't really see it
as superior to a plain adjust(dem_ds)

All things being considered, I would suggest to proceed either with subclassing or with plain functions.

To be more precise, if it was my call, I would just go for adjust(dem_ds). In my experience, keeping things simple and
explicit usually gives more of a benefit in the long run. Furthermore, it makes testing easier + nothing
prevents you to expose a more Object Oriented API in the future.

Deleted user · Answer 1 · Thu Mar 17 2022 01:32:12 GMT+0800 (China Standard Time)

Thank you, this is very informative!