Can I have a defined plot for a transform dataset?

Question

Can I have a defined plot for a transform dataset?

kthyng opened this issue a year ago · comments

I'm defining a catalog in which there is a "base" csv data (ctd_base below) and a dataset that is a transformed/derived version of ctd_base called ctd, which does some processing to produce a more usable dataset. I would like to have a plot available for source ctd — is there a way to do this? Below I'm showing a version of what I've tried but haven't been able to get it to work when I call cat.ctd.plot.example() with the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[18], line 1
----> 1 cat["ctd"].plot.example()

File ~/miniconda3/envs/ciofs/lib/python3.11/site-packages/hvplot/plotting/core.py:92, in hvPlotBase.__call__(self, x, y, kind, **kwds)
     89         plot = self._get_converter(x, y, kind, **kwds)(kind, x, y)
     90         return pn.panel(plot, **panel_dict)
---> 92 return self._get_converter(x, y, kind, **kwds)(kind, x, y)

File ~/miniconda3/envs/ciofs/lib/python3.11/site-packages/hvplot/plotting/core.py:99, in hvPlotBase._get_converter(self, x, y, kind, **kwds)
     97 y = y or params.pop("y", None)
     98 kind = kind or params.pop("kind", None)
---> 99 return HoloViewsConverter(self._data, x, y, kind=kind, **params)

File ~/miniconda3/envs/ciofs/lib/python3.11/site-packages/hvplot/converter.py:389, in HoloViewsConverter.__init__(self, data, x, y, kind, by, use_index, group_label, value_label, backlog, persist, use_dask, crs, fields, groupby, dynamic, grid, legend, rot, title, xlim, ylim, clim, symmetric, logx, logy, loglog, hover, subplots, label, invert, stacked, colorbar, datashade, rasterize, row, col, debug, framewise, aggregator, projection, global_extent, geo, precompute, flip_xaxis, flip_yaxis, dynspread, hover_cols, x_sampling, y_sampling, project, tools, attr_labels, coastline, tiles, sort_date, check_symmetric_max, transforms, stream, cnorm, features, rescale_discrete_levels, **kwds)
    387 self.value_label = value_label
    388 self.label = label
--> 389 self._process_data(
    390     kind, data, x, y, by, groupby, row, col, use_dask,
    391     persist, backlog, label, group_label, value_label,
    392     hover_cols, attr_labels, transforms, stream, kwds
    393 )
    395 self.dynamic = dynamic
    396 self.geo = any([geo, crs, global_extent, projection, project, coastline, features])

File ~/miniconda3/envs/ciofs/lib/python3.11/site-packages/hvplot/converter.py:800, in HoloViewsConverter._process_data(self, kind, data, x, y, by, groupby, row, col, use_dask, persist, backlog, label, group_label, value_label, hover_cols, attr_labels, transforms, stream, kwds)
    798     self.data = data
    799 else:
--> 800     raise ValueError('Supplied data type %s not understood' % type(data).__name__)
    802 if stream is not None:
    803     if streaming:

ValueError: Supplied data type DataFrameTransform not understood

CATALOG:


name: ctd
description: CTD

sources:
  ctd_base:
    description: Base
    driver: csv
    args:
      urlpath: /Users/kthyng/projects/ciofs-hindcast-report/ciofs_hindcast_report/inputs/data/CTD_KBNERR_301933/301933.csv

  ctd:
    description: CTD
    driver: process.DataFrameTransform
    args:
      targets:
        - ctd_base
      transform: "process.ctd"
      transform_kwargs:
        station: kacbcwq
    metadata:
      plots:
        example:
          kind: line
          x: DateTimeStamp
          y: Temp
          width: 800
          height: 600

Also process.DataFrameTransform is the same as what is provided in intake, and process.ctd runs some stuff on the DataFrame.

Thanks for any help!

Martin Durant · Answer 1 · Tue Mar 28 2023 05:00:30 GMT+0800 (China Standard Time)

Is the output of source "ctd" (the result of .read() ) also a dataframe?

Martin Durant · Answer 2 · Tue Mar 28 2023 05:14:34 GMT+0800 (China Standard Time)

I tried the following
Catalog

name: ctd
description: CTD

sources:
  ctd_base:
    description: Base
    driver: csv
    args:
      urlpath: data.csv

  ctd:
    description: CTD
    driver: intake.source.derived.DataFrameTransform
    args:
      targets:
        - ctd_base
      transform: "toolz.identity"
      transform_kwargs: {}
    metadata:
      plots:
        example:
          kind: line

data.csv

a,b
0,1
0.1, 1.1

and cat.ctd.plot() or cat.ctd.plot.example() both did run successfully.

Kristen Thyng · Answer 3 · Tue Mar 28 2023 19:32:17 GMT+0800 (China Standard Time)

Thank you @martindurant! I realize now that my problem is actually that the catalog isn't finding the transform after I recently rearranged the directory structure, not that the source isn't able to understand the plot. I'm not able to get it to recognize my version of DataFrameTransform which has one difference (I have the dask dataframe compute earlier) but it is something to do with my set up since it used to work before I reorganized.

Kristen Thyng · Answer 4 · Tue Mar 28 2023 23:06:41 GMT+0800 (China Standard Time)

@martindurant Ok I see how I became confused: the catalog entry ctd works with my slightly-changed version of DataFrameTransform when I'm just accessing the data with read. However, when I add a plot into the metadata and then try to plot, it is hvplot that cannot find my version of DataFrameTransform with that error:

    798     self.data = data
    799 else:
--> 800     raise ValueError('Supplied data type %s not understood' % type(data).__name__)
    802 if stream is not None:
    803     if streaming:

ValueError: Supplied data type DataFrameTransform not understood

Does hvplot have different rules for how it looks for inputs to catalog entries? The plot works when I use driver: intake.source.derived.DataFrameTransform but not when I use the location of my own DataFrameTransform even though it works with .read().

Martin Durant · Answer 5 · Tue Mar 28 2023 23:16:58 GMT+0800 (China Standard Time)

I'm not exactly sure how hvplot determined the data type, but you should ensure that your class is a subclass of at least intake.source.DataSource . Since I didn't get your exception, it's tricky for me to say what might be going on. You might want to enter debug and find out the valur of data when passing a standard DataFrameTransform versus your version.

Kristen Thyng · Answer 6 · Wed Mar 29 2023 00:08:39 GMT+0800 (China Standard Time)

Thank you for the suggestion. I dug into the relevant code in hvplot and the problem is earlier: the incoming data is not being identified as being an intake source because the start of the transform name doesn't start with "intake".

https://github.com/holoviz/hvplot/blob/fe39eff256f031889f089b239489d33f88658fa0/hvplot/util.py#L322-L326

I'm going to see if I can just use the built-in DataFrameTransform to avoid this issue.

Kristen Thyng · Answer 7 · Wed Mar 29 2023 00:36:47 GMT+0800 (China Standard Time)

Hm, it's going to for sure be a problem as soon as I want to use my DatasetTransform which will be soon. Dang. I guess I'll need to go post at hvplot.

Martin Durant · Answer 8 · Wed Mar 29 2023 00:51:58 GMT+0800 (China Standard Time)

isinstance(data, DataSource) is the actual check. Does that fail for your class? The "intake" check just looks to see whether the library can be imported - you shouldn't need a specific name.

Kristen Thyng · Answer 9 · Wed Mar 29 2023 01:13:36 GMT+0800 (China Standard Time)

When using my transform, the code doesn't make it far enough to check isinstance(data, DataSource) — it returns due to if not check_library(data, 'intake'). But you make a good point — my transform does pass the actual check of isinstance(data, DataSource).

Martin Durant · Answer 10 · Wed Mar 29 2023 01:15:54 GMT+0800 (China Standard Time)

Oh, you are right and I am wrong

def check_library(obj, library):
    if not isinstance(library, list):
        library = [library]
    return any([obj.__module__.split('.')[0].startswith(l) for l in library])

requires the object to have a fully-qualified path in the intake namespace. This is a totally unnecessary requirement! Can you please make an issue with hvplot?

Kristen Thyng · Answer 11 · Wed Mar 29 2023 01:15:57 GMT+0800 (China Standard Time)

The "intake" check just looks to see whether the library can be imported - you shouldn't need a specific name

any([obj.__module__.split('.')[0].startswith(l) for l in library]) in my case has

ipdb>  obj.__module__
'ciofs_hindcast_report.src.process'

and compares that with library which in this case is "intake":

ipdb>  library
['intake']