google / weather-tools

Tools to make weather data accessible and useful.

Home Page:https://weather-tools.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve speed of extracting rows

alxmrs opened this issue · comments

With a global forecast dataset, I've the need to extract data with ~3,244.6k coordinate values (time x lat x lng). As of today, it takes about ~9 seconds to extract 1k rows. So: 9*3243.6/60/60 = ~8 hours per file.

While in streaming pipelines, data is written as soon as it's available, we'd ideally like all of the data to be processed within ~1 hour, so the first forecast is actionable.

Here's an idea that should help speed up the extraction of rows: The extraction of rows should be oriented around coordinates and URIs, not just URIs.

In the BQ pipeline...

def expand(self, paths):
break up the extract_rows steps into two steps. The first of the new steps should open the dataset, filter by area, and then produce chunks of coordinate, URI pairs. (The chunks should be a range of the output of get_coordinates(data_ds, uri). Maybe ~1k coordinates is a good unit? Will likely have to experimentally verify).

The second extract step should consume these pairs (do preprocessing for the rows like filter out variables, etc.). and produce rows:

def to_row(it: t.Dict) -> t.Dict:

Other tactics to investigate:

  • Move these lines outside of the to_rows loop; instead, perform this operation once on the whole xarray dataset:
    temp_row = row_ds.to_pandas().apply(to_json_serializable_type)
  • Determine if XArray's native parallelism capabilities are a good fit to produce rows with multiple threads (https://xarray.pydata.org/en/stable/user-guide/dask.html)
  • Investigate if there's a light-weight way to get coordinate information for the first of the two processing steps (e.g. can we just get the coordinates and not the data? Can we open the dataset, but not in memory?)