try 1-N-N performance tuning with LATERAL subquery

Question

try 1-N-N performance tuning with LATERAL subquery

MyqueWooMiddo opened this issue 2 months ago · comments

MyqueWooMiddo commented 2 months ago

Expected behavior

reference to https://postgis.net/workshops/postgis-intro/knn.html

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-subquery.html

I upgrade spark to 3.5.1 , try LATERAL to calculate 1-N-N (1-Nearest-Neighbour)

I'll get point's 1-N-N inside the same table : data_points(id,longitude,latitude) ,use sedona

Actual behavior

spark do not support this type LATERAL

Steps to reproduce the problem

with t_data as (
select id ,st_point(longitude,latitude) as point from data_points order by 1 limit 1000
)
select * from t_data t1, lateral (
select t2.id,ST_DistanceSpheroid(t1.point,t2.point) as distance from t_data t2
where t1.id!=t2.id order by 2 limit 1
)

Spark throws :
"org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationProject"

I just want to know How can optimize 1-N-N in a large dataset rather than row_number(order by distance) = 1

Settings

Sedona version = 1.5.1

Apache Spark version = 3.5.1

API type = Scala

Scala version = 2.12

JRE version = 1.8

Environment = Standalone

Jia Yu · Answer 1 · Sun Mar 24 2024 15:49:23 GMT+0800 (China Standard Time)

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

MyqueWooMiddo · Answer 2 · Thu Mar 28 2024 19:49:52 GMT+0800 (China Standard Time)

All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.

I think solution with iteral H3 of databricks Mosaic is a good idea.