isamplesorg / export_client

Client for the iSamples export service

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add geoparquet as an optional file format for the export

datadavev opened this issue · comments

Geoparquet 12 is a compressed spatial data format that is convenient for consumers and is becoming widely supported.

Task here is to enable geoparquet as an export file format for iSamples.

Tooling for creating geoparquet is still a bit dynamic, but the following approach worked for me (there are likely optimizations that could be done).

  1. Retrieve the records in json lines
  2. Load the jsonlines into geopandas 3
  3. Export from geopandas to geoparquet 4

This worked for me (I could not determine if this requires loading the entire dataset into memory for processing, which may be an issue if using on the server):

import pandas as pd
import geopandas as gpd

src = "smithsonian"
json_src = f"{src}.jsonl"
with open(json_src, "r") as json_file:
    df = pd.read_json(json_file, lines=True)
gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(
      df.producedBy_samplingSite_location_longitude,
      df.producedBy_samplingSite_location_latitude), 
    crs="EPSG:4326"
)
gdf.to_parquet(f"{src}_geo.parquet")

I think dependencies were:

pip install pandas
pip install geopandas
pip install geoarrow-pyarrow geoarrow-pandas

Footnotes

  1. https://geoparquet.org/

  2. https://getindata.com/blog/introducing-geoparquet-data-format/

  3. https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html

  4. https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_parquet.html

Adding this reference for completeness - a general overview on the state of geospatial data sharing with parquet, arrow, and similar tooling. https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/

I'm running into problems attempting to adapt the sample code you've provided, @datadavev. It looks like your jsonl file has the structure flattened. e.g. producedBy_samplingSite_location_latitude vs.

"producedBy": {
  "samplingSite": {
    "location": {
      "latitude"…

etc. It looks like in Pandas I can go to the first level structure (df.producedBy), but after that it errors out. I'm not sure how to go about getting the nested data out in an efficient manner. I've not really used Pandas before. Any ideas? Another approach I thought about was using duckdb to add those nested fields as top-level keys in the JSON, but that's obviously a hack.

This is specifically on the

gpd.points_from_xy(
      df.producedBy_samplingSite_location_longitude,
      df.producedBy_samplingSite_location_latitude), 

park. How do I read the nested data out of the dictionary on those two lines?