Add geoparquet as an optional file format for the export
datadavev opened this issue · comments
Geoparquet 12 is a compressed spatial data format that is convenient for consumers and is becoming widely supported.
Task here is to enable geoparquet as an export file format for iSamples.
Tooling for creating geoparquet is still a bit dynamic, but the following approach worked for me (there are likely optimizations that could be done).
- Retrieve the records in json lines
- Load the jsonlines into geopandas 3
- Export from geopandas to geoparquet 4
This worked for me (I could not determine if this requires loading the entire dataset into memory for processing, which may be an issue if using on the server):
import pandas as pd
import geopandas as gpd
src = "smithsonian"
json_src = f"{src}.jsonl"
with open(json_src, "r") as json_file:
df = pd.read_json(json_file, lines=True)
gdf = gpd.GeoDataFrame(
df, geometry=gpd.points_from_xy(
df.producedBy_samplingSite_location_longitude,
df.producedBy_samplingSite_location_latitude),
crs="EPSG:4326"
)
gdf.to_parquet(f"{src}_geo.parquet")
I think dependencies were:
pip install pandas
pip install geopandas
pip install geoarrow-pyarrow geoarrow-pandas
Footnotes
Adding this reference for completeness - a general overview on the state of geospatial data sharing with parquet, arrow, and similar tooling. https://dewey.dunnington.ca/post/2022/building-bridges-arrow-parquet-and-geospatial-computing/
I'm running into problems attempting to adapt the sample code you've provided, @datadavev. It looks like your jsonl
file has the structure flattened. e.g. producedBy_samplingSite_location_latitude
vs.
"producedBy": {
"samplingSite": {
"location": {
"latitude"…
etc. It looks like in Pandas I can go to the first level structure (df.producedBy
), but after that it errors out. I'm not sure how to go about getting the nested data out in an efficient manner. I've not really used Pandas before. Any ideas? Another approach I thought about was using duckdb to add those nested fields as top-level keys in the JSON, but that's obviously a hack.
This is specifically on the
gpd.points_from_xy(
df.producedBy_samplingSite_location_longitude,
df.producedBy_samplingSite_location_latitude),
park. How do I read the nested data out of the dictionary on those two lines?