Support Overture parquet conversion to GeoParquet
cholmes opened this issue · comments
The new overture maps has parquet in WKB, but when I try to convert it I get:
% gpq convert 20230725_211237_00132_5p54t_25816df1-b864-49c0-a9a3-a13da4f37a90 out2.parquet --from=parquet --to=geoparquet
gpq: error: encoding parquet data page: encoding not supported for type BYTE_ARRAY
Sample data is at https://storage.googleapis.com/open-geodata/ch/20230725_211237_00132_5p54t_3b7d7eb3-dd9c-442a-a9b9-404dc936c5d9
I've downloaded the admin data and parsed it through DuckDB
db.execute ("""
COPY (
select *
from '**/*.parquet'
WHERE adminLevel = 2
isocountrycodealpha2 is not null
) TO 'admin-countries.parquet'
""")
With this I can then convert to geoparquet using gpq.
I guess this should just work without the need to use DuckDB though?
@mtravis - funny, I just came here to make the same comment, as I had noticed that too.
Yeah, running it through DuckDB in most any way seems to work fine, so it seems to not be anything fundamental with the structure of that data.
I get an error trying to read this file using the Arrow libs directly. I've ticketed this as apache/arrow#37968.
I'll work on trying to narrow it down.
This now works in the latest release. If using brew, you can brew update && brew install planetlabs/tap/gpq
to install the latest. And you can run gpq version
to see what version you have installed.
# the file above is now converted to valid geoparquet
gpq convert overture.parquet --to geoparquet | gpq validate
In case it is of interest to Overture users, I opened a discussion about the Parquet schema here: OvertureMaps/schema#55
Basically, the current schema for names
and sources
is not as specific as it could be (allowing arbitrary properties for names
for example instead of restricting it to the common
, official
, alternate
, and short
described in the JSON Schema). If you think a more specific schema would be harmful or helpful, please chime in.