google / weather-tools

Tools to make weather data accessible and useful.

Home Page:https://weather-tools.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`weather-mv`: Add columns to Schema for faster geolocation queries

alxmrs opened this issue · comments

A suggestion from @lakshmanok:

It would be good to if weather-mv were to add a ST_GeogPoint(longitude, latitude) column instead of (or in addition to) the longitude, latitude columns. This precomputes the S2 location and makes querying faster.

Even better if the column was a polygon with the extent of the pixel. That way ST_INTERSECTS etc. will also work

ticket updated: 2022-03-24

Acceptance Criteria

  • Users of the resulting BigQuery table produced by weather-mv can make use of the Geography capabilities of BigQuery.
    • Every row of the BigQuery table (and thus, all the available variables of that row) has a column of type GEOGRAPHY, which uses "point geography" from the raw data's lat/lng data.
  • Users can still acquire the underling float values for latitude and longitude.

Implementation Notes

  • One approach for implementing this feature would be to use BQ's SQL functions to add perform the conversion to GEOGRAPHY type. Namely, ST_GEOPOINT is a function that can take lat/lng FLOAT64 columns and create a GEOGRAPHY Point value from it.
  • Another simple approach would be to handle the Geography conversion from lat/lng in Python (instead of SQL). The BQ Geospatial docs list at least two pathways to do this: Either by writing WKT / WKB data or by writing in the GeoJSON spec.
  • Please note that the GEOGRAPHY values must fall within certain lat/lng ranges. This means that data outside of these ranges will need to be converted to those ranges (e.g. something like this – see the example section).
  • This will likely require a 2 part code change.

Extended feature to delight users

Even better if the column was a polygon with the extent of the pixel. That way ST_INTERSECTS etc. will also work

Implementing extra columns to include the polygon of the (grid) area where the values are relevant is a great bonus feature. However, it will not be possible for all types of input data. This experience is only possible if the XArray DataFrame includes extra information, like a coordinate or variable attribute.

For now, while we implement the first part of this ticket, let's also investigate if such metadata exists on our happy-path data sources. If it does, we'll create a follow up ticket to implement this.

Acceptance Criteria

  • Investigate the underlying real-time data source in XArray to determine if variables along coordinates are associated with a specific area.
  • Demonstrate results in an interactive report (e.g. a Colab notebook).
  • If this is possible with the data, create a design doc to further extend the BQ ingestion to include this feature, taking into account edge cases (the metadata is not available, users may want to override these columns, complexity in types of grid, etc.).

Hi @alxmrs, I plan on working on this but I am still in the process of trying to understand the code.
https://github.com/google/weather-tools/blob/82971d0559c0d567133e8191d4112a0879a0d888/weather_mv/loader_pipeline/pipeline.py#L250
I have a feeling that this is the function that I should be working on and that I'll add the value of ST_GeogPoint(longitude,latitude) to the existing column but I'm not too sure. Would you say that trying to understand the code while fixing this issue be a good starting point? Also, I would greatly appreciate it if you could share any good practices I should follow or keep in mind when contributing to a preexisting project.

I have a feeling that this is the function that I should be working on
I'll add the value of ST_GeogPoint(longitude,latitude) to the existing column

This sounds right to me. I recommend looking at the logic here (https://github.com/google/weather-tools/blob/main/weather_mv/loader_pipeline/pipeline.py#L370). I think the column should be added around this area, and to the associated functions. Check out this BQ documentation, too: https://cloud.google.com/bigquery/docs/geospatial-data

Would you say that trying to understand the code while fixing this issue be a good starting point?

Yes, I think this is a great starting point. If you'd like additional help, maybe we can set up a 1:1 meeting.

Also, I would greatly appreciate it if you could share any good practices I should follow or keep in mind when contributing to a preexisting project.

Sure thing. The general flow will look like:

  • Fork the project
  • Make the change in a new branch (not the main branch)
  • Test your code (unit tests, manually e2e)
  • Send a draft pull request into the main branch (see the Github docs for clarity on how to do this)
  • Review the code yourself, see if you can catch any errors before it goes to final review. Also, you can see if any errors have occurred on the automated check system (to run them locally, use bin/post-push)
  • Once it's ready, send the code out for review
  • iterate on the branch until all outstanding feedback is addressed
    • if you need help testing the pipelines e2e, please say so.
  • Change will be merged into main.

How does all that sound?

Heads up: The files changed around a lot in #101.