Using in PySpark
sabman opened this issue · comments
Any thoughts on how we could use the library in PySpark?
I was thinking of at least in the case of SpatialJoins changing the Class to work with wkt and wkb string instead of only JTS Geometry objects. This should allow us to write something like:
jvm = sc._jvm
from shapely.geometry import Polygon, Point
rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((1, -1))
def geomABWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt)
])
def geomCWithId():
return sc.parallelize([(0L, rectangleC.wkt)])
def geomABCWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt),
(2L, rectangleC.wkt)])
def geomDWithId():
return sc.parallelize(
[(0L, pointD.wkt)]
)
predicate = jvm.spatialspark.operator.SpatialOperator.Within()
jvm.spatialspark.join.BroadcastSpatialJoin(sc,
geomABWithId(), geomABCWithId(), predicate).collect()
thoughts?
I don't have much experience with pyspark or even python. I may need to spend some time to make it work. My current focus will still be the core part, which are the implementations on top of DataFrame/Dataset APIs.
OK cool! I can help do the Python API I started working on it. Today I tried to hack around a bit to just see if I can make it run. And it works pretty well! https://github.com/sabman/SpatialSpark/tree/python-hack I'll close this for now. Let's connect on gitter