planetlabs / gpq

Utility for working with GeoParquet

Home Page:https://planetlabs.github.io/gpq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Specify the dst crs in convert?

FlorisCalkoen opened this issue · comments

I don't think gpq currently contains a method to specify the target crs. Also I see that by default you use "OGC:CRS84", what is your rationale for that? Why not, for example, use "EPSG:4326"?

I'll add a little bit of context on my use case. So I just used gpq to convert a 'big' collection of parquet files to geoparquet by simply doing gpq convert non-geo.parquet valid-geo.parquet in a for loop. Further in my processing chain I load these geoparquet files using GeoPandas, but I ran into an issue because when the crs == "OGC:CRS84" it cannot be converted to epgs. Although it's expected behaviour I'm mostly just curious why you use "OGC:CRS84" instead of "EPSG:4326".

gdf = gpd.read_parquet("valid-geo.parquet")
print(gdf.crs.to_epsg()) # None
print(gdf.to_crs(4326).to_epsg()) # 4326

I'll probably change my routines from gdf.crs.to_epsg() to gdf.crs.to_string(), but I guess that several others rely on to_epsg() as well when using GeoPandas, so I thought it's worth opening a discussion point here.

Following up, my guess is that dask_geopandas is also struggling to read GeoParquet files thave have been converted with gpq due to a similar issue/decision that has been made in the gpq crs conversion/specification. See example below:

storage_options = {"account_name": <account_name> "credential": <token>}
href = "<protocol>/<container>/<prefix>/valid-geo.parquet"
gdf = dask_geopandas.read_parquet(href, storage_options=storage_options)
Python traceback
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:136, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    135 try:
--> 136     return func(*args, **kwargs)
    137 except Exception as e:

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:538, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, **kwargs)
    536     blocksize = None
--> 538 read_metadata_result = engine.read_metadata(
    539     fs,
    540     paths,
    541     categories=categories,
    542     index=index,
    543     use_nullable_dtypes=use_nullable_dtypes,
    544     dtype_backend=dtype_backend,
    545     gather_statistics=calculate_divisions,
    546     filters=filters,
    547     split_row_groups=split_row_groups,
    548     blocksize=blocksize,
    549     aggregate_files=aggregate_files,
    550     ignore_metadata_file=ignore_metadata_file,
    551     metadata_task_size=metadata_task_size,
    552     parquet_file_extension=parquet_file_extension,
    553     dataset=dataset_options,
    554     read=read_options,
    555     **other_options,
    556 )
    558 # In the future, we may want to give the engine the
    559 # option to return a dedicated element for `common_kwargs`.
    560 # However, to avoid breaking the API, we just embed this
    561 # data in the first element of `parts` for now.
    562 # The logic below is inteded to handle backward and forward
    563 # compatibility with a user-defined engine.

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:57, in GeoArrowEngine.read_metadata(cls, fs, paths, **kwargs)
     55 @classmethod
     56 def read_metadata(cls, fs, paths, **kwargs):
---> 57     meta, stats, parts, index = super().read_metadata(fs, paths, **kwargs)
     59     gather_spatial_partitions = kwargs.pop("gather_spatial_partitions", True)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:549, in ArrowDatasetEngine.read_metadata(cls, fs, paths, categories, index, use_nullable_dtypes, dtype_backend, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, **kwargs)
    548 # Stage 2: Generate output `meta`
--> 549 meta = cls._create_dd_meta(dataset_info)
    551 # Stage 3: Generate parts and stats

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:103, in GeoArrowEngine._create_dd_meta(cls, dataset_info, use_nullable_dtypes)
     99         raise ValueError(
    100             "No dataset parts discovered. Use dask.dataframe.read_parquet "
    101             "to read it as an empty DataFrame"
    102         )
--> 103 meta = cls._update_meta(meta, schema)
    104 return meta

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:77, in GeoArrowEngine._update_meta(cls, meta, schema)
     74 """
     75 Convert meta to a GeoDataFrame and update with potential GEO metadata
     76 """
---> 77 return _update_meta_to_geodataframe(meta, schema.metadata)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/arrow.py:36, in _update_meta_to_geodataframe(meta, schema_metadata)
     35 geometry_column_name = geo_meta["primary_column"]
---> 36 crs = geo_meta["columns"][geometry_column_name]["crs"]
     37 geometry_columns = geo_meta["columns"]

KeyError: 'crs'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[36], line 6
      3 with fsspec.open(href, mode="rb", **storage_options) as f:
      4     gpd.read_parquet(f)
----> 6 dask_geopandas.read_parquet(href, storage_options=storage_options)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:112, in read_parquet(*args, **kwargs)
    111 def read_parquet(*args, **kwargs):
--> 112     result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
    113     # check if spatial partitioning information was stored
    114     spatial_partitions = result._meta.attrs.get("spatial_partitions", None)

File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    136     return func(*args, **kwargs)
    137 except Exception as e:
--> 138     raise type(e)(
    139         f"An error occurred while calling the {funcname(func)} "
    140         f"method registered to the {self.backend} backend.\n"
    141         f"Original Message: {e}"
    142     ) from e

KeyError: "An error occurred while calling the read_parquet method registered to the pandas backend.\nOriginal Message: 'crs'"

It looks like you are running into an issue with dask-geopandas. The crs is optional in a GeoParquet geometry column. It looks like dask-geopandas assumes it will be present here https://github.com/geopandas/dask-geopandas/blob/3489a1cbafbeda3c0d4493133112969268e58d66/dask_geopandas/io/arrow.py#L36

I think this is the same issue geopandas/dask-geopandas#270