planetlabs / gpq

Utility for working with GeoParquet

Home Page:https://planetlabs.github.io/gpq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for convert to stdout

bdon opened this issue · comments

I'd like to do something like this:

gpq convert Cairo_Governorate.parquet --stdout --to=geojson | tippecanoe -o Cairo_Governorate.pmtiles --drop-densest-as-needed

Would this functionality be useful? It would require some changes in convert.go to allow for a blank positional output argument.

Hey @bdon - nice idea. I put together #79 to make all the commands optionally work with stdin/stdout.

If you omit the output arg in the convert command, it writes to stdout. Not as explicit as a --stdout arg. Hopefully isn't trying to be too tricky.

All together!

curl https://data.source.coop/cholmes/google-open-buildings/geoparquet-admin1/country=EGY/Cairo_Governorate.parquet | ./gpq convert --from=geoparquet --to=geojson | tippecanoe -o buildings.pmtiles --force --drop-densest-as-needed

Included in the v0.15.0 release (brew update && brew install planetlabs/tap/gpq or download from the release page).

@bdon - you'll probably notice that this needs to buffer the whole file since the Parquet metadata is in the footer. But that suggests another enhancement - to accept a URL for the input. Then if ranged reads are supported, the metadata could be read first (and then maybe only buffer one data page at a time).

@tschaub have you looked into using https://gocloud.dev for reading Parquet?

For https://github.com/protomaps/go-pmtiles/blob/main/pmtiles/extract.go#L276 I use only the blob functionality, but that means it supports GCP, Azure, and S3-compatible blob storage with credentials out of the box. I had to add a layer of abstraction to handle public unauthenticated HTTP URLs but it was otherwise simple.

I've used similar libs, but not yet gocloud.dev, will check it out.

My ideal would be a multi-cloud blob reader that implemented io.ReadSeeker and io.ReaderAt (I know this isn't efficient for all providers, but it is possible - with lots of guessing to know how much to buffer for the seeker reads).

For PMTiles it uses bucket.NewRangeReader without any guessing - it downloads the entire (compressed) relevant part of the index in advance, and then pre-merges request ranges to avoid thousands of small requests, before fetching any actual "features" (tiles).

Is a similar batching behavior needed to be effective for geoparquet? I haven't delved deeply into actual reader implementations yet.