stac-utils / pystac

Python library for working with any SpatioTemporal Asset Catalog (STAC)

Home Page:https://pystac.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Collections with many items saving time issue

santilland opened this issue · comments

Using pystac[validation] 1.8.3

I am creating collections with a larger amount of items and was surprised by the time it took to save them. I have been doing some very preliminary tests and it somehow seems that the save time increases exponentially with the amount of items in a collection.
For example saving a catalog with 1 collection takes depending on item count:

Items Time
200 0.225s
2000 5.439s
10000 105.975s

If i create 5 collections with 2000 items the saving time is 25s. So the same amount of items are being saved in total but it takes 4 times less when separated into multiple collections.

Any ideas why this could be happening?

Here is a very rough testing script:


import time
from datetime import (
    datetime,
    timedelta,
)
from pystac import (
    Item,
    Catalog,
    CatalogType,
    Collection,
    Extent,
    SpatialExtent,
    TemporalExtent,
)
from pystac.layout import TemplateLayoutStrategy

numdays = 10000
number_of_collections = 1
base = datetime.today()
times = [base - timedelta(days=x) for x in range(numdays)]

catalog = Catalog(
    id = "test",
    description = "catalog to test performance",
    title = "performance test catalog",
    catalog_type=CatalogType.RELATIVE_PUBLISHED,
)

spatial_extent = SpatialExtent([
    [-180.0, -90.0, 180.0, 90.0],
])
temporal_extent = TemporalExtent([[datetime.now()]])
extent = Extent(spatial=spatial_extent, temporal=temporal_extent)


for idx in range(number_of_collections):
    collection = Collection(
        id="big_collection%s"%idx,
        title="collection for items",
        description="some desc",
        extent=extent
    )
    for t in times:
        item = Item(
            id = t.isoformat(),
            bbox=[-180.0, -90.0, 180.0, 90.0],
            properties={},
            geometry = None,
            datetime = t,
        )
        collection.add_item(item)
    collection.update_extent_from_items()

    catalog.add_child(collection)

strategy = TemplateLayoutStrategy(item_template="${collection}/${year}")
catalog.normalize_hrefs("https://exampleurl.com/", strategy=strategy)

start_time = time.perf_counter()
catalog.save(dest_href="../test_build/")
end_time = time.perf_counter()
print(f"Saving Time : {end_time - start_time:0.6f}" )

My timings on pystac (master branch) are as follows:

Collections Items Time
1 200 1.247s
1 2000 12.880s
1 10000 70.173s
5 2000 61.693s

It's weird that the times look reasonable (nearly linear) for my tests, but not for you. The difference is a different machine, installing validation requirements and installing from pypi, while I'm on master.

Strange, i was getting the same slow saving on the github action workflow (which uses ubuntu-latest). My machine is also ubuntu, what OS are you on?
Maybe some file handler issue?

I'm running Ubuntu 22.04.2 LTS through WSL.

I also just ran pystac 1.8.3 installed from pypi with validation requirements and the timings are similar to the ones I reported above. Weird...

Thanks for the report, and thanks @m-mohr for also taking a look. My timings (macos) working from main w/ Python 3.11:

Collections Items Time
1 200 0.67s
1 2000 9.36s
1 10000 102.65s
5 2000 45.48s

So I'm seeing performance similar to @santilland.

I will note that there's some known issues around serializing a STACObject to a dictionary, specifically around link resolution and the network requests it can fire off: #960. This could be part of the issue, though I don't quite know how yet.

Weird. We were just thinking maybe it's an issue with memory consumption / swapping?
My machine is not at max (one of the CPUs is close, but memory is not an issue at all due to 128GB), but @santilland is maxing out both with regards to memory and one of the CPUs.

My timings are from Python 3.10.6

Yeah, I'm seeing my run get CPU bound, but not memory bound. 🤔

I think @m-mohr commented on possible swap issue because my machine in general is close to the edge already but i did not see a special increase in memory use. But one core is always at 100% while saving.

Yeah, maybe it's something else. I'm on WSL and not on native Ubuntu, so maybe some kind of IO thing? Or something completely different :-) I guess profiling on an affected machine may give some insights...