google / weather-tools

Tools to make weather data accessible and useful.

Home Page:https://weather-tools.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`weather-dl`: Implement BigQuery manifest

alxmrs opened this issue · comments

Similar to #13, except the manifest should be a BigQuery table. This makes sense if users choose to use weather-mv and want their data queryable from one place.

Since we're taking the time to fix this, wouldn't Cloud Logging be a better substitute for BQ, GCS, and Firebase?

I'm less familiar with Cloud Logger. How would it help suit our needs here, which are primarily to have a central record of our download status?

Cloud Logging is Stackdriver's new name, where Dataflow pipeline logs show up too. Since it's very unlikely that the users are interested in any complex analytics on the manifest logs, BQ is not a great option. Stackdriver is cheaper, easier to use, and the permissions are less complex to manage.

I've created and tested a tiny Stackdriver manifest implementation here. The logs will show up in Cloud Console like this:

Screen Shot 2022-09-18 at 6 12 07 PM

What every the service / implementation, I ideally want to be able to answer the following questions quickly (say, with a local script or even a dashboard):

  • What data is currently in progress? What's queued for download?
  • What data have we already ingested?
  • Approximately, how long will it take to finish this job?

Do you think that Cloud Logging would help us answer these questions? The Manifest is not really a logging system, but rather a database that we intend to query to answer these questions.

To put this feature in terms of a problem: Right now, our default implementation for the manifest is Firestore. We have a CLI script we include with weather-dl that checks what the current status is:
https://github.com/google/weather-tools/blob/main/weather_dl/download-status

Unfortunately, due to our database choice and current number of records, this script is not performant + hits API limits right away. This inspiration for this issue was:

  • it would make much more sense if we used a traditional RDBMS instead of a key-value store
  • we'd prefer to use BigQuery rather than other DB implementations to start, since our project is pretty wed to BQ.

How about using custom metrics, and having monitoring dashboard ?
This requires modification in the existing pipeline code, where it updates the our custom metrics, and dashboard will be able to show all things in the one place.
This monitoring can be extended for alerting mechanism too.

I like the idea of using metrics to answer some of these questions. However, it looks like custom metrics only last 30 days, so they wouldn't solve the problem of having a record of what was downloaded.

Further, it looks like we could possibly run into quota limits for storing values as metrics, whereas as traditional DB would allow us to have unlimited writes.

Closing this issue as it has been addressed by the changes made in PR #295 which has been merged.