open-space-collective / open-space-toolkit-physics

Physical units, time, reference frames, environment modeling.

Home Page:https://open-space-collective.github.io/open-space-toolkit-physics/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[feat] do not pull manifest file every time something is run

kyle-cochran opened this issue · comments

Is your feature request related to a problem? Please describe.
We pull the manifest file from remote every single time we need to access data. This is so we can check the age of any local data against what's on the remote to know if we need to update anything.

This is way overkill. We can pull the manifest like once per day and be fine. Implement a manifest age concept so we don't have to constantly do IO.

Starting this one today. Thoughts on implementation:

The reason we pull the manifest file is because it contains information on when the files in OSTk Data were last updated. This is the explicit trade-off we make for data file management. Instead of pulling the actual data files [large] to check if they are newer than the ones kept locally, we can pull the manifest.json [small] and get the same information for cheaper, then actually pull the data if needed.

However, this simply moves the problem up one level. How do we know when it's worth it to pull the manifest file to check for updates? The current strategy is to pull it every time we load any data into OSTk. This is overkill for most situations where we will already have the necessary data files downloaded.

So how do we decide when to update the manifest file?

I can think of two options immediately:

1.) Add a new environment configuration: OSTK_PHYSICS_DATA_UPDATE_MAX_FREQUENCY which limits how often we check for new data updates. This is a simple throttle on the manifest check. We check the manifest age against the current time and only update if it's been long enough.

2.) We add an entry to the manifest that represents the manifest itself:

    "manifest": {
        "path": "",
        "filenames": "manifest.json",
        "file_meta": "This file. Entry tracks the expected manifest update time on the remote."
        "last_update": "2023-10-31T12:02:25.052229",
        "next_update_check": "2023-10-31T18:02:25.052742",
        "check_frequency": "6 hours"
    },

From this entry, we can use the "next_update_check" field to predict when we can expect the file to be updated on the remote (i.e. OSTk Data repo). Normally, the "next_update_check" field is what the OSTk Data automation uses to know when it should next reach out to the primary data sources, but it also serves as a decent prediction for when a file might be updated.

#2 allows for better default behavior I think, but #1 allows people who know what they're doing to limit the amount of IO.

My vote would be for having a field in the manifest that specifies when it was last updated, and an environment to prescribe the update frequency as a duration, with a sensible default value (24 hours?)

This is implemented and merged. Ended up implementing both. The managers now look for a "manifest" entry within the manifest, and use the "next_update_check" key to determine when to expect the next update on the remote. However, we will still not fetch unless DATA_REFRESH_RATE_H number of hours have passed.