GeometryType(geom) triggered an exception
ruanqizhen opened this issue · comments
Expected behavior
When I call "GeometryType(geom)" to my table, it triggered a java.lang.NullPointerException exception. But "ST_GeometryType(geom)" works as expected.
It just crashed, so I couldn't find out which row caused the problem.
The track stack:
File "<stdin>", line 1, in <module>
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 423, in show
print(self._jdf.showString(n, int_truncate, vertical))
File "/opt/amazon/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o128.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 23.0 failed 4 times, most recent failure: Lost task 3.3 in stage 23.0 (TID 71) ([2600:1f13:b65:2706:d7b:30e2:5cb1:5cfa] executor 19): java.lang.NullPointerException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2610)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2559)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2558)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2558)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1200)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1200)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1200)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2798)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2740)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2729)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:154)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:88)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:66)
at org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:57)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:241)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:240)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:509)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:471)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3779)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2769)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3770)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3768)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2769)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2976)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:289)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:328)
at sun.reflect.GeneratedMethodAccessor157.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.NullPointerException
Settings
Sedona version = 1.5
Apache Spark version = 3.2
Apache Flink version = ?
API type = Python
Python version = 3.10
Environment = Amazon Athena
@ruanqizhen can you show me the full stacktrace? The one that is after Caused by: java.lang.NullPointerException
?
@ruanqizhen can you show me the full stacktrace? The one that is after
Caused by: java.lang.NullPointerException
?
This is the entire stacktrace it returned. Caused by: java.lang.NullPointerException
is the last line.
@ruanqizhen I think the GeometryType might have indeterministic behavior over invalid geometries. We need to further investigate this issue.
Can you run ST_IsValid
on your geometry column? Is there any invalid geometries? If you remove those and run GeometryType
again, do you still have the same problem?
@ruanqizhen I think the GeometryType might have indeterministic behavior over invalid geometries. We need to further investigate this issue.
Can you run
ST_IsValid
on your geometry column? Is there any invalid geometries? If you remove those and runGeometryType
again, do you still have the same problem?
It is not likely caused by the invalid geometry, because I've called the ST_MakeValid() for all of the geometries.
@ruanqizhen ST_MakeValid
might not work for all cases. Any way to fix geometries is ST_Buffer(geom, 0)
.
If both of the method still cannot fix the issue, I think you can stick to ST_GeometryType. If you don't want the ST_
in the result, you can easily write a PySpark UDF to split the string by _
and only keep the second half.
@ruanqizhen
ST_MakeValid
might not work for all cases. Any way to fix geometries isST_Buffer(geom, 0)
.If both of the method still cannot fix the issue, I think you can stick to ST_GeometryType. If you don't want the
ST_
in the result, you can easily write a PySpark UDF to split the string by_
and only keep the second half.
I'm using ST_GeometryType now. But I'm wondering if there is a way to find out which row caused the problem. Is there a way that I can "try", or skip the error, to let the query continue process other rows, so that I can then check which row was skipped.