try 1-N-N performance tuning with LATERAL subquery
MyqueWooMiddo opened this issue · comments
Expected behavior
reference to https://postgis.net/workshops/postgis-intro/knn.html
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-subquery.html
I upgrade spark to 3.5.1 , try LATERAL to calculate 1-N-N (1-Nearest-Neighbour)
I'll get point's 1-N-N inside the same table : data_points(id,longitude,latitude) ,use sedona
Actual behavior
spark do not support this type LATERAL
Steps to reproduce the problem
with t_data as (
select id ,st_point(longitude,latitude) as point from data_points order by 1 limit 1000
)
select * from t_data t1, lateral (
select t2.id,ST_DistanceSpheroid(t1.point,t2.point) as distance from t_data t2
where t1.id!=t2.id order by 2 limit 1
)
Spark throws :
"org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationProject"
I just want to know How can optimize 1-N-N in a large dataset rather than row_number(order by distance) = 1
Settings
Sedona version = 1.5.1
Apache Spark version = 3.5.1
API type = Scala
Scala version = 2.12
JRE version = 1.8
Environment = Standalone
All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.
All NN join or KNN join is not currently supported in Apache Sedona. We will add the support in one or two months.
I think solution with iteral H3 of databricks Mosaic is a good idea.