something wrong with `rank` method?

Question

something wrong with `rank` method?

RainFung opened this issue 3 years ago · comments

rain commented 3 years ago

rain · Answer 1 · Fri Jun 11 2021 15:21:59 GMT+0800 (China Standard Time)

I guess it might have something to do with ks.set_option("compute.ops_on_diff_frames", True) mechanism.

rain · Answer 2 · Sat Jun 12 2021 11:49:05 GMT+0800 (China Standard Time)

I found the reason because of the index.

ks.set_option("compute.default_index_type", "distributed")
ks.set_option("compute.ops_on_diff_frames", True)

When I use the distributed parameter to calculate the rank, the rank will generate a new dataframe and join with the original dataframe. But the new index and the old index are not consistent, causing the join to fail.

When using rank, we cannot use the distributed parameter.

ks.set_option("compute.default_index_type", "distributed-sequence")
ks.set_option("compute.ops_on_diff_frames", True)

rain · Answer 3 · Sat Jun 12 2021 17:42:03 GMT+0800 (China Standard Time)

When I deal with big data, I often report the following errors below:

Traceback (most recent call last):
  File "/pythonenv/python3/lib/python3.6/site-packages/databricks/koalas/internal.py", line 713, in attach_distributed_sequence_column
    jrdd = jdf.localCheckpoint(False).rdd().zipWithIndex()
  File "/data1/yarnenv/local/appcache/application_1619753973429_65911389/container_e04_1619753973429_65911389_01_000001/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/data1/yarnenv/local/usercache/appcache/application_1619753973429_65911389/container_e04_1619753973429_65911389_01_000001/pyspark.zip/pyspark/sql/utils.py", line 67, in deco
  File "/data1/yarnenv/local/usercache/appcache/application_1619753973429_65911389/container_e04_1619753973429_65911389_01_000001/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o24209.localCheckpoint.
: java.lang.StackOverflowError
	at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:438)