syoummer / SpatialSpark

Big Spatial Data Processing using Spark

Home Page:http://simin.me/projects/spatialspark/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using in PySpark

sabman opened this issue · comments

Any thoughts on how we could use the library in PySpark?
I was thinking of at least in the case of SpatialJoins changing the Class to work with wkt and wkb string instead of only JTS Geometry objects. This should allow us to write something like:

jvm = sc._jvm

from shapely.geometry import Polygon, Point

rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((1, -1))

def geomABWithId():
  return sc.parallelize([
    (0L, rectangleA.wkt),
    (1L, rectangleB.wkt)
  ])

def geomCWithId():
  return sc.parallelize([(0L, rectangleC.wkt)])

def geomABCWithId():
  return sc.parallelize([
  (0L, rectangleA.wkt),
  (1L, rectangleB.wkt),
  (2L, rectangleC.wkt)])

def geomDWithId():
  return sc.parallelize(
    [(0L, pointD.wkt)]
  )


predicate = jvm.spatialspark.operator.SpatialOperator.Within()
jvm.spatialspark.join.BroadcastSpatialJoin(sc, 
  geomABWithId(), geomABCWithId(), predicate).collect()

thoughts?

I don't have much experience with pyspark or even python. I may need to spend some time to make it work. My current focus will still be the core part, which are the implementations on top of DataFrame/Dataset APIs.

OK cool! I can help do the Python API I started working on it. Today I tried to hack around a bit to just see if I can make it run. And it works pretty well! https://github.com/sabman/SpatialSpark/tree/python-hack I'll close this for now. Let's connect on gitter