Example H3 data sets

Question

Example H3 data sets

ajfriend opened this issue 2 months ago · comments

It would be nice to have a collection of data sets using H3 that folks can use for examples or are just generally useful.

Some ideas:

countries, US states, US zip codes as H3 cells at various resolutions
water vs land cells

https://geodatasets.readthedocs.io/en/latest/introduction.html is a Python package that does something similar, but for general geographic datasets.

Aside from what examples we want, I think we'd also need to decide:

what data format we'd use, or if we'd use multiple
how we store the examples---in the repo, or point to external hosting

Isaac Brodsky · Answer 1 · Sun May 19 2024 21:00:38 GMT+0800 (China Standard Time)

It would be nice to have a collection of data sets using H3 that folks can use for examples or are just generally useful.

This seems like it could be helpful as a reference dataset.

Some ideas:

Another one that comes to mind are the various US census geometries (essentially, anything in the TIGER dataset).

https://geodatasets.readthedocs.io/en/latest/introduction.html is a Python package that does something similar, but for general geographic datasets.

Aside from what examples we want, I think we'd also need to decide:
* what data format we'd use, or if we'd use multiple

I think it would make sense to have multiple formats, some users might want a simple text based format like CSV or JSON, while others may prefer efficient binary formats like Parquet (as uint64).

* how we store the examples---in the repo, or point to external hosting

Considering the format duplication, the fact that the text files can be very large, and the relatively independent maintenance concerns, I recommend outside of the repo. I believe we already do that in master for country geometries used in testing.

AJ Friend · Answer 2 · Mon May 20 2024 00:46:30 GMT+0800 (China Standard Time)

I think it would make sense to have multiple formats, some users might want a simple text based format like CSV or JSON, while others may prefer efficient binary formats like Parquet (as uint64).

Agreed.

Considering the format duplication, the fact that the text files can be very large, and the relatively independent maintenance concerns, I recommend outside of the repo. I believe we already do that in master for country geometries used in testing.

Yes, I definitely agree we should host these through a separate repo (maybe something like h3datasets?). It was more that I was wondering if in that repo we host the raw data, or if it should point to some other storage location. The geodatasets package uses the latter strategy. If we were using the former strategy, I was curious if we thought we might run into github file and repo size limits (the repo we point to here comes in at 17GB). Maybe we can start with the in-repo approach and pivot to external hosting if necessary. If we do end up needing external storage, any ideas on what services we might use?

Isaac Brodsky · Answer 3 · Mon May 20 2024 01:10:53 GMT+0800 (China Standard Time)

Yes, I definitely agree we should host these through a separate repo (maybe something like h3datasets?). It was more that I was wondering if in that repo we host the raw data, or if it should point to some other storage location. The geodatasets package uses the latter strategy. If we were using the former strategy, I was curious if we thought we might run into github file and repo size limits (the repo we point to here comes in at 17GB). Maybe we can start with the in-repo approach and pivot to external hosting if necessary. If we do end up needing external storage, any ideas on what services we might use?

Ah, I see. The two options I'd suggest are S3 and Cloudflare R2. R2 is cheaper and more modern (which incidentally can cause issues if you happen to use HTTP-only software, as it enforces SSL). In the mean time in the repo seems like an OK place to start.