google / weather-tools

With a global forecast dataset, I've the need to extract data with ~3,244.6k coordinate values (time x lat x lng). As of today, it takes about ~9 seconds to extract 1k rows. So: 9*3243.6/60/60 = ~8 hours per file.

While in streaming pipelines, data is written as soon as it's available, we'd ideally like all of the data to be processed within ~1 hour, so the first forecast is actionable.

Here's an idea that should help speed up the extraction of rows: The extraction of rows should be oriented around coordinates and URIs, not just URIs.

In the BQ pipeline...

weather-tools/weather_mv/loader_pipeline/bq.py

Line 93 in ecb194c

def expand(self, paths):

break up the extract_rows steps into two steps. The first of the new steps should open the dataset, filter by area, and then produce chunks of coordinate, URI pairs. (The chunks should be a range of the output of get_coordinates(data_ds, uri). Maybe ~1k coordinates is a good unit? Will likely have to experimentally verify).

The second extract step should consume these pairs (do preprocessing for the rows like filter out variables, etc.). and produce rows:

weather-tools/weather_mv/loader_pipeline/bq.py

Line 173 in ecb194c

def to_row(it: t.Dict) -> t.Dict:

Other tactics to investigate:

Move these lines outside of the to_rows loop; instead, perform this operation once on the whole xarray dataset:

weather-tools/weather_mv/loader_pipeline/bq.py

Line 181 in ecb194c

temp_row = row_ds.to_pandas().apply(to_json_serializable_type)
Determine if XArray's native parallelism capabilities are a good fit to produce rows with multiple threads (https://xarray.pydata.org/en/stable/user-guide/dask.html)
Investigate if there's a light-weight way to get coordinate information for the first of the two processing steps (e.g. can we just get the coordinates and not the data? Can we open the dataset, but not in memory?)

Improve speed of extracting rows