planetlabs / gpq

Utility for working with GeoParquet

Home Page:https://planetlabs.github.io/gpq/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support control over number of row groups as an option

cholmes opened this issue · comments

When converting to GeoParquet it can be useful to set more row groups, for more efficient querying on large files. See opengeospatial/geoparquet#183

GDAL's is 'ROW_GROUP_SIZE=: Defaults to 65536. Maximum number of rows per group.'

Which seems reasonable, though I was doing like 20k default size for my experiments, so we could consider having the default be less - I didn't see negative effects, but something I read said if you have lots of parquet files then smaller row group size can affect the times of getting stats on the whole set. I think I have like 500 individual parquet files, so perhaps if it's thousands or tens of thousands it comes into effect?

Oh, other thing that would be nice is to maintain the number of row groups in a parquet to geoparquet conversion. I tried this and it didn't seem to.