stac-utils / pystac

Python library for working with any SpatioTemporal Asset Catalog (STAC)

Home Page:https://pystac.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiple saves of the collection.json (saves collection for every item)

bellie888 opened this issue · comments

Hi, we are producing a pySTAC over about maybe 1,000,000 COG items. Each collection is from 20 to 100 items. I noticed that every time a new item is added to a collection, the collection is saved again - thus a collection maybe be saved 100 times during the saving collection building proceedure. Since saving a file on S3 is relatively time consuming, is there some way that the whole collection.json file could be built and then just save it once? The proceedure seems to be built into pySTAC as far as I can see at the moment (still learning). Cheers from Australia.
PS we have a customSTACIO Class - maybe it is not written very well, suggestions?

Following your description, I cannot reproduce (example requires shapely):

import datetime
from typing import Any, Dict

import shapely.geometry

from pystac import Collection, Extent, Item, SpatialExtent, TemporalExtent
from pystac.stac_io import DefaultStacIO
from pystac.utils import HREF

BBOX = [-180, -90, 180, 90]
GEOMETRY = shapely.geometry.mapping(shapely.geometry.box(*BBOX))


class ReproducingStacIO(DefaultStacIO):
    def save_json(
        self, dest: HREF, json_dict: Dict[str, Any], *args: Any, **kwargs: Any
    ) -> None:
        print(f"Saving JSON to {dest}")
        return super().save_json(dest, json_dict, *args, **kwargs)


collection = Collection(
    "issue-1251",
    "Attempting to reproduce",
    extent=Extent(
        spatial=SpatialExtent(bboxes=[BBOX]),
        temporal=TemporalExtent(intervals=[[datetime.datetime.utcnow(), None]]),
    ),
)
collection._stac_io = ReproducingStacIO()

collection.normalize_and_save("/tmp")

for i in range(10):
    item = Item(
        id=f"item-{i}",
        geometry=GEOMETRY,
        bbox=BBOX,
        datetime=datetime.datetime.utcnow(),
        properties={},
    )
    print(f"Adding item {item.id}")
    collection.add_item(item)  # Does this trigger a write?

collection.normalize_and_save("/tmp")

Produces:

$ python issue_1251.py
Saving JSON to /tmp/collection.json
Adding item item-0
Adding item item-1
Adding item item-2
Adding item item-3
Adding item item-4
Adding item item-5
Adding item item-6
Adding item item-7
Adding item item-8
Adding item item-9
Saving JSON to /tmp/item-0/item-0.json
Saving JSON to /tmp/item-1/item-1.json
Saving JSON to /tmp/item-2/item-2.json
Saving JSON to /tmp/item-3/item-3.json
Saving JSON to /tmp/item-4/item-4.json
Saving JSON to /tmp/item-5/item-5.json
Saving JSON to /tmp/item-6/item-6.json
Saving JSON to /tmp/item-7/item-7.json
Saving JSON to /tmp/item-8/item-8.json
Saving JSON to /tmp/item-9/item-9.json
Saving JSON to /tmp/collection.json

So I'd look into your custom StacIO as the culprit. Closing as cant-reproduce, but please re-open if you find otherwise. 🍻