apache / sedona

A cluster computing framework for processing large-scale geospatial data

Home Page:https://sedona.apache.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

try 1-N-N performance tuning with LATERAL subquery

MyqueWooMiddo opened this issue · comments

Expected behavior

reference to https://postgis.net/workshops/postgis-intro/knn.html

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-subquery.html

I upgrade spark to 3.5.1 , try LATERAL to calculate 1-N-N (1-Nearest-Neighbour)

I'll get point's 1-N-N inside the same table : data_points(id,longitude,latitude) ,use sedona

Actual behavior

spark do not support this type LATERAL

Steps to reproduce the problem

with t_data as (
select id ,st_point(longitude,latitude) as point from data_points order by 1 limit 1000
)
select * from t_data t1, lateral (
select t2.id,ST_DistanceSpheroid(t1.point,t2.point) as distance from t_data t2
where t1.id!=t2.id order by 2 limit 1
)

Spark throws :
"org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationProject"

I just want to know How can optimize 1-N-N in a large dataset rather than row_number(order by distance) = 1

Settings

Sedona version = 1.5.1

Apache Spark version = 3.5.1

API type = Scala

Scala version = 2.12

JRE version = 1.8

Environment = Standalone

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

I think solution with iteral H3 of databricks Mosaic is a good idea.