uber / h3

Hexagonal hierarchical geospatial indexing system

Home Page:https://h3geo.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add additional modes for polygonToCells

warnes opened this issue · comments

TLDR: Allow specification of 'contained', 'intersects', 'covers' for polygonToCells.

Details:

For my project, I need to be able to convert arbitrary multipolygons to the set of hexes which completely cover the enclosed area.

The current implementation of polygonToCells only generates a list of cells whose centroids are located inside of the (multi)polygon.

My current workaround is:

  1. expand the original multipolygon by edge_length(res) using st_buffer
  2. use polygonToCells to generate the list of cell ids
  3. Use st_overlaps and st_contains to flag all the cells that cover the original multipolygon.

Unfortunately, this is somewhat slow, so I'm looking for a faster way to accomplish this.

(Note that the algorithm described at #608 (comment) won't work if any area of the original polygon is too small/narrow to contain a cell centroid.).

One somewhat crude option (but effective and avoids some extra st_ calls) would be the following:

  1. If desired h3 resolution is R, first use R+2 for generating the indexes for "covering" the polygon
  2. For each of the cell indexes identified in step 1, find the unique set of indexes at resolution R
  3. Review what you end up with, it might be a bit "extra chunky" and have some cells that barely touch the original shape - but gets around the challenge of losing coverage cells that did not get included due to cell centroid not within the shape
  4. Review what you end up with in the case where the original polygon/multipolygon contains holes, you may have some custom considerations on when you want to retain or eliminate holes

"Your mileage may vary" with this approach, but hopefully it helps

One somewhat crude option (but effective and avoids some extra st_ calls) would be the following:

  1. If desired h3 resolution is R, first use R+2 for generating the indexes for "covering" the polygon
  2. For each of the cell indexes identified in step 1, find the unique set of indexes at resolution R
  3. Review what you end up with, it might be a bit "extra chunky" and have some cells that barely touch the original shape - but gets around the challenge of losing coverage cells that did not get included due to cell centroid not within the shape
  4. Review what you end up with in the case where the original polygon/multipolygon contains holes, you may have some custom considerations on when you want to retain or eliminate holes

"Your mileage may vary" with this approach, but hopefully it helps

This approach would work, but caveat emptor:

  • You could still end up with missed cells that intersect the polygon
  • You'll use roughly 49x the memory that you'd use at res R, which can be a problem for large polygons

The alternative I generally use, which is computationally expensive but maybe cheaper than above, but not as memory intensive:

  • Start with the set returned from polygonToCells
  • Trace all of the line segments of the polygon, sampling at some distance (usually I think I take an edge length from one of the cells I have in my starter set) and then taking gridDisk(cell, 1) of every sampled cell and add to a candidate set
  • Remove any cells in the candidate set already in the starter set
  • Remove any cells in the candidate set that do not intersect
  • Add candidates to starter set

This catches a bunch of cases that buffering and other approaches might not (imagine for example a polygon with a long thin peninsula sticking out, which might miss cells even at R+2 if narrow enough).

Actually, since the current polyfill algorithm starts by tracing outlines and then iteratively fills the shape toward the center it's possible to adapt the first step (outline calculation) to use another predicate (e.g. intersection vs centroid) and keep the second half of the algo (filling inward) unchanged.

At least that's the approach I used in h3o, and so far the results are looking good (example), but I could have missed some corner cases.

Actually, since the current polyfill algorithm starts by tracing outlines and then iteratively fills the shape toward the center it's possible to adapt the first step (outline calculation) to use another predicate (e.g. intersection vs centroid) and keep the second half of the algo (filling inward) unchanged.

At least that's the approach I used in h3o, and so far the results are looking good (example), but I could have missed some corner cases.

That's basically the plan for the multi-mode option, unless the alternative algorithm I've been considering turns out to be faster (I have a plan for polygonToCompactCells that would allow results to be streamed instead of returned in a giant memory block, but it might be slower). But that's only different from what I'm describing if you're implementing/updating the algo in the library itself, not if you're a library consumer.

(I have a plan for polygonToCompactCells that would allow results to be streamed instead of returned in a giant memory block, but it might be slower).

Ha interesting!

I also started to think about it when I was trying to optimize the polyfill, but didn't get very far.

In the end I was able to mitigate the memory usage by streaming the cells as they are generated using iterators (now I can polyfill Russia at resolution 10 using ~100M of RAM only) so it was less of an issue.
But if you want to compact it afterward you need to load the whole set in memory, so it would be great to have the ability to generate compacted set from the get-go 🙂

What was the approach/algorithm you had in mind?

What was the approach/algorithm you had in mind?

Taking this as a good excuse to document my thinking here 😁

  • New function: polygonToCompactCells
  • Input: Polygon, res R, Memory for candidate set, memory for output set, status flag (START, IN_PROGRESS, DONE)
  • If status flag is START:
    • Populate candidate set with the 122 base cells
  • While there are cells in the candidate set and output set is not full:
    • Pop the next cell from the candidate set
    • For cells with res < R, test cell convex hull against polygon
      • If convex hull is contained, add to output set
      • If convex hull intersects, add all immediate children to candidate set
    • For cells with res R, test cell according to polyfill mode (intersection, containment, centerpoint, etc)
      • If test passes, add to output set
  • If candidate set is empty, set status to DONE
  • Else set status to IN_PROGRESS

Some notes here:

  • The convex hull allows us to cheaply test all children for some parent, avoiding issues of hierarchical non-containment
  • Output is compact by default, though it might not be perfect - it's possible that all children of some parent will pass the test when the parent does not. I'm inclined to think that's acceptable; callers that really don't like it could run the result through compactCells (I think - I'd need to check whether compactCells can take multi-res input). The other option would be to always test a cell's children within the loop and check for the all-passing case.
  • The memory for the output set can be of arbitrary size. If it's too small for the whole output set, the caller can repeat the call with the current candidate set as frequently as needed until the status is DONE. This allows for streaming chunks of output with very little memory usage.
  • The size of the candidate set can be fixed:
    • The first uint64 would be the index of the last cell in the set, so we don't have to send that separately
    • The rest of the set would be of size 122 + 6 * R + 1 (122 for the base cells, 6 for each res before R, because we cut out the parent before adding the children, and 7 for the cells at R), so max 214 * 64 bits ~= 1.7Kb.
    • If we cared, we could use two uint64 slots as a bitmap for the base cells, bringing this size down to 94 * 64 bits ~= 752 bytes, but it's probably not worth the extra complexity.
  • Assuming there isn't a significant performance hit, we can replace polygonToCells with a call to polygonToCompactCells followed by uncompactCells. One downside here is that we'll need memory for the compact set + memory for the uncompacted set. I think it's possible to output uncompacted cells directly, but maintaining state is a little trickier, because you'd add cellToChildren(cell, R) to the output set and you might not have enough room, so you'd have to pass the child iterator back to the caller as well.

One advantage(?) of this approach is that we have to implement polygonContainsPolygon and polygonIntersectsPolygon in order to run the algorithm, so adding support for these modes should be trivial.

Improvement from @ajfriend - the candidate set can be a single cell! Simpler, and the memory requirement is down to almost nothing:

  • New function: polygonToCompactCells(polygon, res, *currentCell, *out)
  • If currentCell is H3_NULL:
    • Set currentCell to base cell 0
  • While currentCell exists and output set is not full:
    • For cells with res < R, test cell convex hull against polygon
      • If convex hull is contained, add to output set
      • If convex hull intersects, set currentCell to first child and continue
    • For cells with res R, test cell according to polyfill mode (intersection, containment, centerpoint, etc)
      • If test passes, add to output set
    • Set currentCell to next cell:
      • while true:
        • If not the last sibling of parent, set currentCell to next sibling and break
        • If cell res == 0, set currentCell to next base cell and break, or H3_NULL and break if no more base cells
        • Set currentCell to cellToParent(cell, res - 1)

No need for a flag, if currentCell is H3_NULL you're done, otherwise you can pass the current cell back in with more memory.

Really interesting, thanks for sharing 🙏

I'll try to implement this approach in h3o when I get the bandwidth to do so, curious to see how it perform in practice.

One part I'm not sure to understand is this branch:

Set currentCell to next cell:

  • Set currentCell to cellToParent(cell, res - 1)

If a cell at R=1 intersect with the shape, we will explore its children at R=2
Once we've tested the last sibling we skip the first (next sibling) and second branch (next base cell) and we insert back the parent cell at R=1 in the currentCell => aren't we stuck in a loop here, or did I miss something?

That's why we have the while true loop - this branch doesn't break, so it continues the loop and looks for the next sibling of the parent. That way we can walk all the way back up the hierarchy until we get to an ancestor that has a next sibling.

There's almost certainly a nicer way to write this, but that was the first version I thought of.

Update on this (sorry, I've taken over this issue as a thread on the new polyfill algo):

  • I have this working with iterators, which allows us to also offer the non-compact version with very low overhead. I ended up using simple bounding boxes instead of the convex hull - less accurate, much faster.
  • Memory usage is small and fixed size, which is a big win.
  • The performance benefit is murkier. It isn't better or worse than the current algorithm - it just has different characteristics. In benchmarks, it's faster than the current algorithm for large numbers of contiguous cells (because uncompact is faster than flood fill) but slower for large numbers of polygon vertices (because the containment check scales with the number of vertices).

On the plus side (relevant to the original ticket) I had to implement cellBoundaryInsidePolygon for the algorithm, so containment and intersection modes should be a very easy lift once this is merged.

The additional modes have been implemented in #796