syoummer / SpatialSpark

Big Spatial Data Processing using Spark

Home Page:http://simin.me/projects/spatialspark/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does this project have the R-tree feature currently?

ChenZhongPu opened this issue · comments

For some operations, building R-tree index for MBR is much efficient, and does this support it ?

I have a R-tree implementation using Spark but seems not very efficient based on my benchmark, so I decided to remove it at current phase.

Can I directly bulk load data and build index using JTS , and save the index into file for storage ? more see at http://stackoverflow.com/questions/29113702/strtree-in-jts-topology-suite-bulk-load-data-and-build-index.

Here is your code in BroadcastSpaticalJoin.scala:

 //create R-tree on right dataset
    val strtree = new STRtree()
    val rightGeometryWithIdLocal = rightGeometryWithId.collect()
    rightGeometryWithIdLocal.foreach(x => {val y = x._2.getEnvelopeInternal; y.expandBy(radius); strtree.insert(y, x)})
    val rtreeBroadcast = sc.broadcast(strtree)
    leftGeometryWithId.flatMap(x => queryRtree(rtreeBroadcast, x._1, x._2, joinPredicate, radius))

If the right dataset is big enough, can it (strtree ) fill in memory well ?

I am not much know about parallel computing. Does RDD operation has such magic power to parallel it automatically?

The assumption for broadcast based join is the right dataset fits in memory, which is introduced in our tech. report. If it is not the case, partition based join is the solution.

As my second question,

Can I directly bulk load data and build index using JTS , 

and save the index into file for storage ? more see at 

http://stackoverflow.com/questions/29113702/strtree-in-jts-topology-suite-bulk-load-data-and-build-index.

It seems that R-tree in JTS do the bulk loading when query method is called. Therefore, saving the strtree object into file for future use seems making no sense.Right ?

it depends.

From my understanding, you are trying to use JTS to bulk load an R-tree for very large dataset, which I think is not feasible. As I mentioned to you, I have an R-tree implementation without JTS on Spark for such purpose but the performance is not very good. Now I am thinking about implementing an R-tree structure similar to spatialhadoop. I have already implemented several components but recently I have no time to finish it.