stac-utils / pystac

Python library for working with any SpatioTemporal Asset Catalog (STAC)

Home Page:https://pystac.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

New pickling methods loads all absolute links

emmanuelmathot opened this issue · comments

Since the PR #1285 and more specifically this change, it seems that a deepcopy of an item will load all the links with an absolute href.
Is it intended or am I missing something?
This causes issue when loading assets using get_assets method that makes first a deep copy of the stac object that uses pickling.
When I load an item with unreachable links (e.g. s3 url but no custom IO reader set) and try to list the assets, it raises an issue.

    self.assets = list(
rio_tiler/io/stac.py:149: in _get_assets
    for asset, asset_info in stac_item.get_assets().items():
venv/lib/python3.11/site-packages/pystac/asset.py:300: in get_assets
    return {
venv/lib/python3.11/site-packages/pystac/asset.py:301: in <dictcomp>
    k: deepcopy(v)
/usr/lib/python3.11/copy.py:172: in deepcopy
    y = _reconstruct(x, memo, *rv)
/usr/lib/python3.11/copy.py:271: in _reconstruct
    state = deepcopy(state, memo)
/usr/lib/python3.11/copy.py:146: in deepcopy
    y = copier(x, memo)
/usr/lib/python3.11/copy.py:231: in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
/usr/lib/python3.11/copy.py:161: in deepcopy
    rv = reductor(4)
venv/lib/python3.11/site-packages/pystac/item.py:179: in __getstate__
    d["links"] = [
venv/lib/python3.11/site-packages/pystac/item.py:180: in <listcomp>
    link.to_dict() if link.get_href() else link for link in d["links"]
venv/lib/python3.11/site-packages/pystac/link.py:181: in get_href
    and self.owner.get_root()
venv/lib/python3.11/site-packages/pystac/stac_object.py:326: in get_root
    root_link.resolve_stac_object()
venv/lib/python3.11/site-packages/pystac/link.py:330: in resolve_stac_object
    obj = stac_io.read_stac_object(target_href, root=root)
venv/lib/python3.11/site-packages/pystac/stac_io.py:234: in read_stac_object
    d = self.read_json(source, *args, **kwargs)
venv/lib/python3.11/site-packages/pystac/stac_io.py:205: in read_json
    txt = self.read_text(source, *args, **kwargs)
venv/lib/python3.11/site-packages/pystac/stac_io.py:282: in read_text
    return self.read_text_from_href(href)
venv/lib/python3.11/site-packages/pystac/stac_io.py:300: in read_text_from_href
    with urlopen(req) as f:
/usr/lib/python3.11/urllib/request.py:216: in urlopen
    return opener.open(url, data, timeout)
/usr/lib/python3.11/urllib/request.py:519: in open
    response = self._open(req, data)
/usr/lib/python3.11/urllib/request.py:541: in _open
    return self._call_chain(self.handle_open, 'unknown',
/usr/lib/python3.11/urllib/request.py:496: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.UnknownHandler object at 0x74a76915ba50>
req = <urllib.request.Request object at 0x74a768cd8dd0>

    def unknown_open(self, req):
        type = req.type
>       raise URLError('unknown url type: %s' % type)
E       urllib.error.URLError: <urlopen error unknown url type: s3>

It looks to me like we've gotten bit by get_link()'s default to transform_hrefs=True again (for previous art, see #960). I'll open a PR with a fix.

I'm not sure this is true, see follow-on comment for more info.

@emmanuelmathot can you provide a minimum-reproducible example so I can be sure I'm testing against the same problem? I was not able to reproduce the behavior you described with this test:

def test_non_existent_link_during_deepcopy(item: Item) -> None:
    item.add_link(pystac.Link("non-existent-asset", "../not-a-dir/not-a-file"))
    item = copy.deepcopy(item)
    assert item.get_single_link("non-existent-asset").href == "../not-a-dir/not-a-file"

@emmanuelmathot do you have an example that includes creating that test file? I'd like to be able to dig into the process that's actually doing the href modifications.

No I do not but a very simple item with one link with absolute s3 href makes the error.

This is really similar to what you mentioned here

It looks to me like we've gotten bit by get_link()'s default to transform_hrefs=True again (for previous art, see #960).

when I put transform_href=False in the __getstate__ method

   d["links"] = [
            link.to_dict(transform_href=False) if link.get_href(transform_href=False) else link for link in d["links"]
        ]

There is no more error

@emmanuelmathot got it, thanks. Fix in #1337 which we'll release after merging.