planetlabs / gpq

Utility for working with GeoParquet

Home Page:https://planetlabs.github.io/gpq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gpq convert output of Overture parquet files cannot be read by GDAL

geographika opened this issue · comments

I was testing the Overture maps data and realised it is only available in parquet and not geoparquet format. As I understand it this is a user case for gpq as mentioned in #57

The tools runs fine and seems to produce output, but I cannot read this using GDAL. Apologies if this is user error or should be a GDAL issue instead - please close if this is the case.

Full steps to recreate below (note I was using gpq on a Windows machine, and testing the output on both Windows and Linux.

Download data:

aws s3 cp --region us-west-2 --no-sign-request --recursive s3://overturemaps-us-west-2/release/2023-10-19-alpha.0/theme=buildings C:\Temp\buildings.parquet

Run conversion:

$env:PATH += ";D:\Tools\gpq-windows-amd64"
gpq version
# 0.20.0

gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet"

# also tried without compression (no difference in terms of validity)

gpq convert part-00769-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet test.geo.parquet --from="parquet" --to="geoparquet" --compression="uncompressed"

gpq validate test.geo.parquet 

Summary: Passed 20 checks.

 ✓ file must include a "geo" metadata key
 ✓ metadata must be a JSON object
 ✓ metadata must include a "version" string
 ✓ metadata must include a "primary_column" string
 ✓ metadata must include a "columns" object
 ✓ column metadata must include the "primary_column" name
 ✓ column metadata must include a valid "encoding" string
 ✓ column metadata must include a "geometry_types" list
 ✓ optional "crs" must be null or a PROJJSON object
 ✓ optional "orientation" must be a valid string
 ✓ optional "edges" must be a valid string
 ✓ optional "bbox" must be an array of 4 or 6 numbers
 ✓ optional "epoch" must be a number
 ✓ geometry columns must not be grouped
 ✓ geometry columns must be stored using the BYTE_ARRAY parquet type
 ✓ geometry columns must be required or optional, not repeated
 ✓ all geometry values match the "encoding" metadata
 ✓ all geometry types must be included in the "geometry_types" metadata (if not empty)
 ✓ all polygon geometries must follow the "orientation" metadata (if present)
 ✓ all geometries must fall within the "bbox" metadata (if present)

QGIS opens the file but the attribute table is empty. Testing with ogrinfo:

ogrinfo --version
# GDAL 3.7.2, released 2023/09/05
ogrinfo test.geo.parquet

Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
INFO: Open of `test.geo.parquet'
      using driver `Parquet' successful.
1: test.geo

Trying to read the data gives the likely cause of the issue: ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range. Max Level: 1.

ogrinfo test.geo.parquet -al

Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
INFO: Open of `test.geo.parquet'
      using driver `Parquet' successful.

Layer name: test.geo
Geometry: Unknown (any)
Feature Count: 815104
ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1
Layer SRS WKT:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Geometry Column = geometry
categories.main: String (0.0)
categories.alternate: StringList (0.0)
level: Integer (0.0)
socials: StringList (0.0)
subType: String (0.0)
numFloors: Integer (0.0)
entityId: String (0.0)
class: String (0.0)
sourceTags: String(JSON) (0.0)
localityType: String (0.0)
emails: StringList (0.0)
drivingSide: String (0.0)
adminLevel: Integer (0.0)
road: String (0.0)
isoCountryCodeAlpha2: String (0.0)
isoSubCountryCode: String (0.0)
updateTime: String (0.0)
wikidata: String (0.0)
confidence: Real (0.0)
defaultLanguage: String (0.0)
brand.wikidata: String (0.0)
isIntermittent: Integer(Boolean) (0.0)
connectors: StringList (0.0)
surface: String (0.0)
version: Integer (0.0)
phones: StringList (0.0)
id: String (0.0)
context: String (0.0)
height: Real (0.0)
maritime: Integer(Boolean) (0.0)
websites: StringList (0.0)
isSalt: Integer(Boolean) (0.0)
bbox.minx: Real (0.0)
bbox.maxx: Real (0.0)
bbox.miny: Real (0.0)
bbox.maxy: Real (0.0)
ERROR 1: ReadNext() failed: Malformed levels. min: 2 max: 2 out of range.  Max Level: 1

Testing with the GDAL validate script from here


apt-get install python3-pip --fix-missing
python3 -m pip install jsonschema
python3 validate_geoparquet.py --check-data test.geo.parquet

Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
Segmentation fault

I also tried the 0.20.0 release on a Linux box directly with the same issue/GDAL errors as above.

Thanks for the report, @geographika. It may be that gdal cannot read Parquet files with v2 data pages. I can try writing an older version.

Thanks for the reply @tschaub and for the tool!
The field warnings may have been addressed by OSGeo/gdal#8262 (with link to discussion in OSGeo/gdal#8227), although there is no mention of the Malformed levels error.

@geographika - This may or may not be the same issue you are experiencing, but it looks to me like OGR/GDAL cannot read a field that is a list of structs. I'll ticket this on the GDAL repo to get more info, but it may be that GDAL doesn't handle all of Parquet's logical field types.

I see that OSGeo/gdal#8262 is related. I ticketed this issue as OSGeo/gdal#8606 to get some more info.

Thanks @tschaub for following up on this. Just to note GDAL reads the raw Overture parquet files fine - as a table with records, but once converted to GeoParquet the file "loads" but is empty.

Parquet:

image

File converted using gpq to Geoparquet:

image

@geographika - I agree that there is something odd going on. But I'm tempted to believe that it has to do with OGR trying to ignore those logical field types that it cannot handle. Specifically, it does not currently read logical lists where the elements are groups. The brand.name.common field is one example of this (the elements of the list are group fields).

I downloaded one of the building Parquet files and named it input.parquet. It looks like ogr2ogr will strip out the columns it cannot handle if I do this:

ogr2ogr no-unhandled.parquet input.parquet

After this, I can create a new, valid GeoParquet file with this:

gpq convert no-unhandled.parquet no-unhandled-geo.parquet

And then I can verify that OGR can read it with this:

ogrinfo no-unhandled-geo.parquet -al

OSGeo/gdal#8608 adds support to OGR to read columns that have a list of structs (like the Overture data).

I've created apache/arrow#38503 in hopes of tracking down the remaining incompatibility.

@geographika - The v0.21.0 release has a fix that should address the issue converting Overture data (brew update && brew install planetlabs/tap/gpq if you use Homebrew).

With one of the above Overture files (named input.parquet below), here is what I get:

# gpq version
0.21.0

# gpq convert input.parquet input-geo.parquet

# gpq describe input-geo.parquet
gpq describe input-geo.parquet 
╭──────────────────────┬─────────┬─────────────────────────────────┬────────────┬─────────────┬──────────┬────────────────┬────────┬────────╮
│ COLUMN               │ TYPE    │ ANNOTATION                      │ REPETITION │ COMPRESSION │ ENCODING │ GEOMETRY TYPES │ BOUNDS │ DETAIL │
├──────────────────────┼─────────┼─────────────────────────────────┼────────────┼─────────────┼──────────┼────────────────┼────────┼────────┤
│ categories           │         │ group                           │ 0..1       │             │          │                │        │        │
│ level                │ int32   │ int(bitwidth=32, issigned=true) │ 0..1       │ zstd        │          │                │        │        │
│ socials              │         │ list                            │ 0..1       │             │          │                │        │        │
│ subType              │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ numFloors            │ int32   │ int(bitwidth=32, issigned=true) │ 0..1       │ zstd        │          │                │        │        │
│ entityId             │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ class                │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ sourceTags           │         │ map                             │ 0..1       │             │          │                │        │        │
│ localityType         │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ emails               │         │ list                            │ 0..1       │             │          │                │        │        │
│ drivingSide          │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ adminLevel           │ int32   │ int(bitwidth=32, issigned=true) │ 0..1       │ zstd        │          │                │        │        │
│ road                 │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ isoCountryCodeAlpha2 │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ isoSubCountryCode    │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ updateTime           │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ wikidata             │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ confidence           │ double  │                                 │ 0..1       │ zstd        │          │                │        │        │
│ defaultLanguage      │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ brand                │         │ group                           │ 0..1       │             │          │                │        │        │
│ addresses            │         │ list                            │ 0..1       │             │          │                │        │        │
│ names                │         │ group                           │ 0..1       │             │          │                │        │        │
│ isIntermittent       │ boolean │                                 │ 0..1       │ zstd        │          │                │        │        │
│ connectors           │         │ list                            │ 0..1       │             │          │                │        │        │
│ surface              │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ version              │ int32   │ int(bitwidth=32, issigned=true) │ 0..1       │ zstd        │          │                │        │        │
│ phones               │         │ list                            │ 0..1       │             │          │                │        │        │
│ id                   │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ geometry             │ binary  │                                 │ 0..1       │ zstd        │ WKB      │                │        │        │
│ context              │ binary  │ string                          │ 0..1       │ zstd        │          │                │        │        │
│ height               │ double  │                                 │ 0..1       │ zstd        │          │                │        │        │
│ maritime             │ boolean │                                 │ 0..1       │ zstd        │          │                │        │        │
│ sources              │         │ list                            │ 0..1       │             │          │                │        │        │
│ websites             │         │ list                            │ 0..1       │             │          │                │        │        │
│ isSalt               │ boolean │                                 │ 0..1       │ zstd        │          │                │        │        │
│ bbox                 │         │ group                           │ 1          │             │          │                │        │        │
├──────────────────────┼─────────┴─────────────────────────────────┴────────────┴─────────────┴──────────┴────────────────┴────────┴────────┤
│ Rows                 │ 815104                                                                                                             │
│ Row Groups           │ 1                                                                                                                  │
│ GeoParquet Version   │ 1.0.0                                                                                                              │
╰──────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

# ogrinfo input-geo.parquet -al
Warning 1: Field brand.names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field brand.names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field addresses of unhandled type list<element: struct<freeform: string, locality: string, postCode: string, region: string, country: string>> ignored
Warning 1: Field names.common of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.official of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.alternate of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field names.short of unhandled type list<element: struct<value: string, language: string>> ignored
Warning 1: Field sources of unhandled type list<element: struct<property: string, dataset: string, recordId: string, confidence: double>> ignored
INFO: Open of `input-geo.parquet'
      using driver `Parquet' successful.

Layer name: input-geo
Geometry: Unknown (any)
Feature Count: 815104
Extent: (-179.760039, -62.216330) - (179.962804, 72.784394)
Layer SRS WKT:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        MEMBER["World Geodetic System 1984 (G2139)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Geometry Column = geometry
categories.main: String (0.0)
categories.alternate: StringList (0.0)
level: Integer (0.0)
socials: StringList (0.0)
subType: String (0.0)
numFloors: Integer (0.0)
entityId: String (0.0)
class: String (0.0)
sourceTags: String(JSON) (0.0)
localityType: String (0.0)
emails: StringList (0.0)
drivingSide: String (0.0)
adminLevel: Integer (0.0)
road: String (0.0)
isoCountryCodeAlpha2: String (0.0)
isoSubCountryCode: String (0.0)
updateTime: String (0.0)
wikidata: String (0.0)
confidence: Real (0.0)
defaultLanguage: String (0.0)
brand.wikidata: String (0.0)
isIntermittent: Integer(Boolean) (0.0)
connectors: StringList (0.0)
surface: String (0.0)
version: Integer (0.0)
phones: StringList (0.0)
id: String (0.0)
context: String (0.0)
height: Real (0.0)
maritime: Integer(Boolean) (0.0)
websites: StringList (0.0)
isSalt: Integer(Boolean) (0.0)
bbox.minx: Real (0.0)
bbox.maxx: Real (0.0)
bbox.miny: Real (0.0)
bbox.maxy: Real (0.0)
OGRFeature(input-geo):0
  categories.main (String) = (null)
  categories.alternate (StringList) = (null)
  level (Integer) = (null)
  socials (StringList) = (null)
  subType (String) = (null)
  numFloors (Integer) = (null)
  entityId (String) = (null)
  class (String) = (null)
  sourceTags (String(JSON)) = (null)
  localityType (String) = (null)
  emails (StringList) = (null)
  drivingSide (String) = (null)
  adminLevel (Integer) = (null)
  road (String) = (null)
  isoCountryCodeAlpha2 (String) = (null)
  isoSubCountryCode (String) = (null)
  updateTime (String) = 2020-03-20T18:08:51.000Z
  wikidata (String) = (null)
  confidence (Real) = (null)
  defaultLanguage (String) = (null)
  brand.wikidata (String) = (null)
  isIntermittent (Integer(Boolean)) = (null)
  connectors (StringList) = (null)
  surface (String) = (null)
  version (Integer) = 0
  phones (StringList) = (null)
  id (String) = w783118772@1
  context (String) = (null)
  height (Real) = (null)
  maritime (Integer(Boolean)) = (null)
  websites (StringList) = (null)
  isSalt (Integer(Boolean)) = (null)
  bbox.minx (Real) = 56.6205337
  bbox.maxx (Real) = 56.6207643
  bbox.miny (Real) = 54.3153349
  bbox.maxy (Real) = 54.3154768
  POLYGON ((56.6205337 54.3154585,56.6205677 54.3153349,56.6207643 54.3153532,56.6207304 54.3154768,56.6205337 54.3154585))

# ... etc.

It looks like after OSGeo/gdal#8608 is released, those warnings will go away and the additional columns will be read as well.

@tschaub - many thanks for following this up and releasing the new version - much appreciated.
I can confirm I now get GeoParquey files I can read with GDAL (and MapServer).

$env:PATH += ";D:\Tools\gpq-windows-amd64."
gpq version
# 0.21.0

gpq convert D:\Data\type=administrativeBoundary\part-00018-87dd7d19-acc8-4d4f-a5ba-20b407a79638.c000.zstd.parquet D:\Data\test.geo.parquet --from=parquet --to=geoparquet

gpq validate D:\Data\test.geo.parquet
# Summary: Passed 20 checks.

ogrinfo D:\Data\test.geo.parquet -al

#INFO: Open of `/data/overture/test.geo.parquet'
#      using driver `Parquet' successful.
#
#Layer name: test.geo
#Geometry: Unknown (any)
#Feature Count: 13455