planetlabs / gpq

Utility for working with GeoParquet

Home Page:https://planetlabs.github.io/gpq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dealing with parquet without geometry columns

cholmes opened this issue · comments

I was checking a few files to see if they were compliant, but wasn't looking super closely and did convert with one that had no geometries in it. GPQ happily converted it, and then 'describe' showed:

╭────────────────────────────────────────────┬────────┬────────────┬────────────┬─────────────┬──────────┬────────────────┬────────┬────────╮
│ COLUMN                                     │ TYPE   │ ANNOTATION │ REPETITION │ COMPRESSION │ ENCODING │ GEOMETRY TYPES │ BOUNDS │ DETAIL │
├────────────────────────────────────────────┼────────┼────────────┼────────────┼─────────────┼──────────┼────────────────┼────────┼────────┤
│ CBSA Code                                  │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Metropolitan Division Code                 │ double │            │ 0..1       │ zstd        │          │                │        │        │
│ CSA Code                                   │ double │            │ 0..1       │ zstd        │          │                │        │        │
│ CBSA Title                                 │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Metropolitan/Micropolitan Statistical Area │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Metropolitan Division Title                │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ CSA Title                                  │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ County/County Equivalent                   │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ State Name                                 │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ FIPS State Code                            │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ FIPS County Code                           │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ Central/Outlying County                    │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
│ stcofips                                   │ binary │ string     │ 0..1       │ zstd        │          │                │        │        │
├────────────────────────────────────────────┼────────┼────────────┴────────────┴─────────────┴──────────┴────────────────┴────────┴────────┤
│ ROWS                                       │ 1916   │                                                                                     │
│ VERSION                                    │ 1.0.0  │                                                                                     │
╰────────────────────────────────────────────┴────────┴─────────────────────────────────────────────────────────────────────────────────────╯

The 1.0.0 version threw me off a bit. I think it's technically valid in the spec, and looks like gpq writes out metadata, but not sure if we should call a parquet file without geometries 1.0.0.

The file does not validate:

 ✓ file must include a "geo" metadata key
 ✓ metadata must be a JSON object
 ✓ metadata must include a "version" string
 ✓ metadata must include a "primary_column" string
 ✓ metadata must include a "columns" object
 ✓ column metadata must include the "primary_column" name
 ✓ column metadata must include a valid "encoding" string
 ✓ column metadata must include a "geometry_types" list
 ✓ optional "crs" must be null or a PROJJSON object
 ✓ optional "orientation" must be a valid string
 ✓ optional "edges" must be a valid string
 ✓ optional "bbox" must be an array of 4 or 6 numbers
 ✓ optional "epoch" must be a number
 ✗ geometry columns must not be grouped
   ↳ missing geometry column "geometry"
 ! geometry columns must be stored using the BYTE_ARRAY parquet type
   ↳ not checked
 ! geometry columns must be required or optional, not repeated
   ↳ not checked
 ! all geometry values match the "encoding" metadata
   ↳ not checked
 ! all geometry types must be included in the "geometry_types" metadata (if not empty)
   ↳ not checked
 ! all polygon geometries must follow the "orientation" metadata (if present)
   ↳ not checked
 ! all geometries must fall within the "bbox" metadata (if present)
   ↳ not checked

It could be nice to do a 'has geometry column' check first, and just inform people that the data their validating does not have a geometry.

It also might be nice to put in some 'warning' when you try to convert a file that does not have a geometry. Or could even say it's not allowed (maybe allow some force) option.

Anyways, I think the situation is ok now, but we could likely help people a bit more. I think we're going to see awhile where there's parquet files that aren't geoparquet, and it'd be nice to help people along.

This should be addressed in the latest release (0.19.0).

I grabbed some random userdata1.parquet and tried this:

# gpq describe userdata1.parquet 
╭───────────────────┬────────┬────────────┬────────────┬──────────────╮
│ COLUMN            │ TYPE   │ ANNOTATION │ REPETITION │ COMPRESSION  │
├───────────────────┼────────┼────────────┼────────────┼──────────────┤
│ registration_dttm │ int96  │            │ 0..1       │ uncompressed │
│ id                │ int32  │            │ 0..1       │ uncompressed │
│ first_name        │ binary │ string     │ 0..1       │ uncompressed │
│ last_name         │ binary │ string     │ 0..1       │ uncompressed │
│ email             │ binary │ string     │ 0..1       │ uncompressed │
│ gender            │ binary │ string     │ 0..1       │ uncompressed │
│ ip_address        │ binary │ string     │ 0..1       │ uncompressed │
│ cc                │ binary │ string     │ 0..1       │ uncompressed │
│ country           │ binary │ string     │ 0..1       │ uncompressed │
│ birthdate         │ binary │ string     │ 0..1       │ uncompressed │
│ salary            │ double │            │ 0..1       │ uncompressed │
│ title             │ binary │ string     │ 0..1       │ uncompressed │
│ comments          │ binary │ string     │ 0..1       │ uncompressed │
├───────────────────┼────────┴────────────┴────────────┴──────────────┤
│ Rows              │ 1000                                            │
│ Row Groups        │ 1                                               │
╰───────────────────┴─────────────────────────────────────────────────╯
 ⚠️  Not a valid GeoParquet file (missing the "geo" metadata key). Run convert to try to convert it to GeoParquet.

So then I tried to convert it:

# gpq convert userdata1.parquet maybe-geo.parquet
gpq: error: expected a geometry column named "geometry", use the --input-primary-column to supply a different primary geometry

And then followed the suggestion to try --input-primary-column:

# gpq convert userdata1.parquet maybe-geo.parquet --input-primary-column first_name
gpq: error: wkt: unsupported geometry

All that is expected (the first_name is not WKT or WKB). As described in #87 (comment), this unfortunately would have worked with a non-string binary column (trusting that the data was WKB). But then validate would fail.

A --strict option could be added that either applied validation while writing or validated after writing in the convert command. But that would be kind of involved.