Specify the dst crs in convert?
FlorisCalkoen opened this issue · comments
I don't think gpq currently contains a method to specify the target crs. Also I see that by default you use "OGC:CRS84", what is your rationale for that? Why not, for example, use "EPSG:4326"?
I'll add a little bit of context on my use case. So I just used gpq
to convert a 'big' collection of parquet files to geoparquet by simply doing gpq convert non-geo.parquet valid-geo.parquet
in a for loop. Further in my processing chain I load these geoparquet files using GeoPandas
, but I ran into an issue because when the crs == "OGC:CRS84"
it cannot be converted to epgs. Although it's expected behaviour I'm mostly just curious why you use "OGC:CRS84" instead of "EPSG:4326".
gdf = gpd.read_parquet("valid-geo.parquet")
print(gdf.crs.to_epsg()) # None
print(gdf.to_crs(4326).to_epsg()) # 4326
I'll probably change my routines from gdf.crs.to_epsg()
to gdf.crs.to_string()
, but I guess that several others rely on to_epsg() as well when using GeoPandas, so I thought it's worth opening a discussion point here.
Following up, my guess is that dask_geopandas
is also struggling to read GeoParquet files thave have been converted with gpq
due to a similar issue/decision that has been made in the gpq
crs conversion/specification. See example below:
storage_options = {"account_name": <account_name> "credential": <token>}
href = "<protocol>/<container>/<prefix>/valid-geo.parquet"
gdf = dask_geopandas.read_parquet(href, storage_options=storage_options)
Python traceback
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:136, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
135 try:
--> 136 return func(*args, **kwargs)
137 except Exception as e:
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:538, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, **kwargs)
536 blocksize = None
--> 538 read_metadata_result = engine.read_metadata(
539 fs,
540 paths,
541 categories=categories,
542 index=index,
543 use_nullable_dtypes=use_nullable_dtypes,
544 dtype_backend=dtype_backend,
545 gather_statistics=calculate_divisions,
546 filters=filters,
547 split_row_groups=split_row_groups,
548 blocksize=blocksize,
549 aggregate_files=aggregate_files,
550 ignore_metadata_file=ignore_metadata_file,
551 metadata_task_size=metadata_task_size,
552 parquet_file_extension=parquet_file_extension,
553 dataset=dataset_options,
554 read=read_options,
555 **other_options,
556 )
558 # In the future, we may want to give the engine the
559 # option to return a dedicated element for `common_kwargs`.
560 # However, to avoid breaking the API, we just embed this
561 # data in the first element of `parts` for now.
562 # The logic below is inteded to handle backward and forward
563 # compatibility with a user-defined engine.
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:57, in GeoArrowEngine.read_metadata(cls, fs, paths, **kwargs)
55 @classmethod
56 def read_metadata(cls, fs, paths, **kwargs):
---> 57 meta, stats, parts, index = super().read_metadata(fs, paths, **kwargs)
59 gather_spatial_partitions = kwargs.pop("gather_spatial_partitions", True)
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:549, in ArrowDatasetEngine.read_metadata(cls, fs, paths, categories, index, use_nullable_dtypes, dtype_backend, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, **kwargs)
548 # Stage 2: Generate output `meta`
--> 549 meta = cls._create_dd_meta(dataset_info)
551 # Stage 3: Generate parts and stats
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:103, in GeoArrowEngine._create_dd_meta(cls, dataset_info, use_nullable_dtypes)
99 raise ValueError(
100 "No dataset parts discovered. Use dask.dataframe.read_parquet "
101 "to read it as an empty DataFrame"
102 )
--> 103 meta = cls._update_meta(meta, schema)
104 return meta
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:77, in GeoArrowEngine._update_meta(cls, meta, schema)
74 """
75 Convert meta to a GeoDataFrame and update with potential GEO metadata
76 """
---> 77 return _update_meta_to_geodataframe(meta, schema.metadata)
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/arrow.py:36, in _update_meta_to_geodataframe(meta, schema_metadata)
35 geometry_column_name = geo_meta["primary_column"]
---> 36 crs = geo_meta["columns"][geometry_column_name]["crs"]
37 geometry_columns = geo_meta["columns"]
KeyError: 'crs'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[36], line 6
3 with fsspec.open(href, mode="rb", **storage_options) as f:
4 gpd.read_parquet(f)
----> 6 dask_geopandas.read_parquet(href, storage_options=storage_options)
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:112, in read_parquet(*args, **kwargs)
111 def read_parquet(*args, **kwargs):
--> 112 result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
113 # check if spatial partitioning information was stored
114 spatial_partitions = result._meta.attrs.get("spatial_partitions", None)
File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
136 return func(*args, **kwargs)
137 except Exception as e:
--> 138 raise type(e)(
139 f"An error occurred while calling the {funcname(func)} "
140 f"method registered to the {self.backend} backend.\n"
141 f"Original Message: {e}"
142 ) from e
KeyError: "An error occurred while calling the read_parquet method registered to the pandas backend.\nOriginal Message: 'crs'"
It looks like you are running into an issue with dask-geopandas. The crs
is optional in a GeoParquet geometry column. It looks like dask-geopandas assumes it will be present here https://github.com/geopandas/dask-geopandas/blob/3489a1cbafbeda3c0d4493133112969268e58d66/dask_geopandas/io/arrow.py#L36
I think this is the same issue geopandas/dask-geopandas#270