Add geoparquet as an optional file format for the export

Question

Add geoparquet as an optional file format for the export

datadavev opened this issue 2 months ago · comments

Geoparquet ¹² is a compressed spatial data format that is convenient for consumers and is becoming widely supported.

Task here is to enable geoparquet as an export file format for iSamples.

Tooling for creating geoparquet is still a bit dynamic, but the following approach worked for me (there are likely optimizations that could be done).

Retrieve the records in json lines
Load the jsonlines into geopandas ³
Export from geopandas to geoparquet ⁴

This worked for me (I could not determine if this requires loading the entire dataset into memory for processing, which may be an issue if using on the server):

import pandas as pd
import geopandas as gpd

src = "smithsonian"
json_src = f"{src}.jsonl"
with open(json_src, "r") as json_file:
    df = pd.read_json(json_file, lines=True)
gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(
      df.producedBy_samplingSite_location_longitude,
      df.producedBy_samplingSite_location_latitude), 
    crs="EPSG:4326"
)
gdf.to_parquet(f"{src}_geo.parquet")

I think dependencies were:

pip install pandas
pip install geopandas
pip install geoarrow-pyarrow geoarrow-pandas

Dave Vieglais · Answer 1 · Wed May 08 2024 01:27:28 GMT+0800 (China Standard Time)

Adding this reference for completeness - a general overview on the state of geospatial data sharing with parquet, arrow, and similar tooling. https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/

Danny Mandel · Answer 2 · Wed May 08 2024 06:59:38 GMT+0800 (China Standard Time)

I'm running into problems attempting to adapt the sample code you've provided, @datadavev. It looks like your jsonl file has the structure flattened. e.g. producedBy_samplingSite_location_latitude vs.

"producedBy": {
  "samplingSite": {
    "location": {
      "latitude"…

etc. It looks like in Pandas I can go to the first level structure (df.producedBy), but after that it errors out. I'm not sure how to go about getting the nested data out in an efficient manner. I've not really used Pandas before. Any ideas? Another approach I thought about was using duckdb to add those nested fields as top-level keys in the JSON, but that's obviously a hack.

Danny Mandel · Answer 3 · Wed May 08 2024 07:00:27 GMT+0800 (China Standard Time)

This is specifically on the

gpd.points_from_xy(
      df.producedBy_samplingSite_location_longitude,
      df.producedBy_samplingSite_location_latitude),

park. How do I read the nested data out of the dictionary on those two lines?

Add geoparquet as an optional file format for the export

Footnotes