planetlabs / gpq

Utility for working with GeoParquet

Home Page:https://planetlabs.github.io/gpq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support Overture parquet conversion to GeoParquet

cholmes opened this issue · comments

The new overture maps has parquet in WKB, but when I try to convert it I get:

% gpq convert 20230725_211237_00132_5p54t_25816df1-b864-49c0-a9a3-a13da4f37a90 out2.parquet --from=parquet --to=geoparquet
gpq: error: encoding parquet data page: encoding not supported for type BYTE_ARRAY

Sample data is at https://storage.googleapis.com/open-geodata/ch/20230725_211237_00132_5p54t_3b7d7eb3-dd9c-442a-a9b9-404dc936c5d9

@cholmes

I've downloaded the admin data and parsed it through DuckDB

db.execute ("""
COPY (
select * 
from '**/*.parquet'
WHERE adminLevel = 2
isocountrycodealpha2 is not null
) TO 'admin-countries.parquet'
""")

With this I can then convert to geoparquet using gpq.

I guess this should just work without the need to use DuckDB though?

@mtravis - funny, I just came here to make the same comment, as I had noticed that too.

Yeah, running it through DuckDB in most any way seems to work fine, so it seems to not be anything fundamental with the structure of that data.

I get an error trying to read this file using the Arrow libs directly. I've ticketed this as apache/arrow#37968.

I'll work on trying to narrow it down.

This now works in the latest release. If using brew, you can brew update && brew install planetlabs/tap/gpq to install the latest. And you can run gpq version to see what version you have installed.

# the file above is now converted to valid geoparquet
gpq convert overture.parquet --to geoparquet | gpq validate

In case it is of interest to Overture users, I opened a discussion about the Parquet schema here: OvertureMaps/schema#55

Basically, the current schema for names and sources is not as specific as it could be (allowing arbitrary properties for names for example instead of restricting it to the common, official, alternate, and short described in the JSON Schema). If you think a more specific schema would be harmful or helpful, please chime in.