intake / intake

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Home Page:https://intake.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Making saving catalog more robust

kthyng opened this issue · comments

Hi! Right now saving a catalog relies on serialize in base.py which relies on _captured_init_kwargs. This has worked okay for me in the past but for some reason despite it seeming like the same situation, I cannot save a new batch of catalogs because all the important information is stored in _captured_init_args instead of _captured_init_kwargs. I am not sure what is controlling that, despite looking through many files.

def serialize(self):
"""
Produce YAML version of this catalog.
Note that this is not the same as ``.yaml()``, which produces a YAML
block referring to this catalog.
"""
import yaml
output = {"metadata": self.metadata, "sources": {},
"name": self.name, "description": self.description}
for key, entry in self._entries.items():
kw = entry._captured_init_kwargs.copy()
kw.pop('catalog', None)
kw['parameters'] = {k.name: k.__getstate__()['kwargs'] for k in kw.get('parameters', [])}
try:
if issubclass(kw['driver'], DataSourceBase):
kw['driver'] = ".".join([kw['driver'].__module__, kw['driver'].__name__])
except TypeError:
pass # ignore exception for a string input
output["sources"][key] = kw
return yaml.dump(output)
def save(self, url, storage_options=None):
"""
Output this catalog to a file as YAML
Parameters
----------
url : str
Location to save to, perhaps remote
storage_options : dict
Extra arguments for the file-system
"""
from fsspec import open_files
with open_files([url], **(storage_options or {}), mode='wt')[0] as f:
f.write(self.serialize())

But, perhaps a better approach would be to make the serialize function more robust and allow for the information being in either place? (Or, in individual saved attributes instead of "captured"?) I am not sure what a good suggestion is for this, I can only think of a simple combination of the two at this point. What do you think? I am not sure what all the related issues are.

Thanks.

Actually, there's no real reason we can't support ordered args as well as kwargs. The YAML stub for a source has "args" which is a key-value map treated like a kwargs dict. One of its entries could be a special key that gets turned into *args.
This would take some development. It may make more sense to find the reason that your particular source is special.

The catalog I've been making is from intake-erddap: https://github.com/axiom-data-science/intake-erddap. I can't figure out what is different about it that makes it so I can't save it! I can add a MWE tomorrow.

I am assuming you have some sources, then, like https://github.com/axiom-data-science/intake-erddap/blob/main/intake_erddap/erddap.py#L43 .
This has some positional arguments (dataset_id, protocol) which are presumably generated by the parent catalog instance. I would suggest that they should be passed as dataset_id=, protocol= kwargs, and then you will have no problems. You could always edit the captured args, I suppose, to make sure this is the case before serilalising (or that library could provide its own serialisation).

Ah, the ERDDAP catalog currently builds catalog entries using LocalCatalogEntry but then the inputs don't end up in _captured_init_kwargs. I am playing around with it and I see if I instead build up the entries using TableDAPSource from erddap.py then I am able to get the inputs as keyword arguments like you suggest and they are then present in _captured_init_kwargs so that the catalog can be saved.

Is it incorrect to use LocalCatalogEntry instead of making our own entries with TableDAPSource?

Using the entries if fine, and usually expected. A catalog always has entries of some sort, that resolve to source instances only on access. Normally, all the kwargs you are after are in the ._open_args attribute of a LocalCatalogEntry. As far as I can tell, the erddap cat does make a normal kwargs dict here, so I really can't tell where the *args you are struggling with are coming from. Could it be because some attributes of the entry are assigned after creation in that same code block?

@lukecampbell has been working on this and indeed all he had to change was to keyword arguments for it to work:

https://github.com/axiom-data-science/intake-erddap/blob/76ec33e638d87797c041392b9ace8a40a5173128/intake_erddap/erddap_cat.py#L261-L268

I'll close this now. Thank you!