[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark

Question

[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark

james-miles-ccy opened this issue 2 years ago · comments

James Miles commented 2 years ago

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When running v2 excel pySpark code below in Databricks 11.3 LTS Runtime:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)

I receive the following error upon attempting to display or use the resulting dataframe:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 101) (10.94.235.131 executor 1): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

Expected Behavior

The resulting Dataframe should display correctly.

Steps To Reproduce

set the folderpath variable to a location containing excel files, and run the below python code in latest runtime of Databricks:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)

Environment

- Spark version:3.3.0
- Spark-Excel version:0.18.5
- OS:Windows 10
- Cluster environment

Anything else?

No response

Martin Mauch · Answer 1 · Wed Nov 16 2022 21:18:08 GMT+0800 (China Standard Time)

Hey @james-miles-ccy, the Spark-Excel version should consist of the Spark version and the version of Spark-Excel itself.
You were only specifying the version of Spark-Excel. Can you check you were using 3.3.1_0.18.5?

James Miles · Answer 2 · Wed Nov 16 2022 21:33:53 GMT+0800 (China Standard Time)

Yes I am using 3.3.1_0.18.5

Martin Mauch · Answer 3 · Thu Nov 17 2022 00:13:09 GMT+0800 (China Standard Time)

Can you check the same thing with a local or other non-Databricks Spark 3.3.0?
We already had the case once where Databricks used a slightly different and not fully API-compatible version of Spark in their Runtime than the officially published one.

James Miles · Answer 4 · Tue Nov 22 2022 23:22:59 GMT+0800 (China Standard Time)

I have installed Pyspark/spark-excel locally and V1 format works fine and generates dataframes in 3.3.1 spark version, but using a path for multiple files (ie V2 format) is causing issues where cells are hanging/not completing. I am using the same spark-excel version as stated above.

Martin Mauch · Answer 5 · Wed Nov 23 2022 00:53:32 GMT+0800 (China Standard Time)

Is it the same error/issue as on DataBricks?

James Miles · Answer 6 · Thu Nov 24 2022 23:00:15 GMT+0800 (China Standard Time)

No, in Databricks you receive the error listed in my original comment, where as local causes endless/ non completing execution.

FYI, this is only an issue for v2, v1 works in both Databricks and local.

snehawankhade · Answer 7 · Thu Dec 01 2022 04:25:41 GMT+0800 (China Standard Time)

I am facing same issue with V2 (Spark version:3.3.0, Spark-excel: 3.3.1_0.18.5). v1 works but not completely. input_file_name() returns empty string.

Martin Mauch · Answer 8 · Thu Dec 01 2022 18:41:01 GMT+0800 (China Standard Time)

input_file_name is only supported in v2. Unfortunately, I didn't have time to look into the original issue.

Darren Fuller · Answer 9 · Sun Dec 11 2022 23:50:27 GMT+0800 (China Standard Time)

Hey @nightscape. This got mentioned in our implementation as well

I think I've traced the issue down to Databricks using a patched spark runtime in the 11.x runtimes (and 12.0 beta runtime) which includes a change from the master branch of Spark which isn't in the 3.3 support branch.

I'm looking into this further at the moment and I'll shout if I find anything

Darren Fuller · Answer 10 · Sat Dec 24 2022 02:45:24 GMT+0800 (China Standard Time)

Just to add an update. I've been talking with Databricks and there's a fix coming which we'll resolve this in the 11.x and 12.x runtimes. Should hopefully be coming in January

Martin Mauch · Answer 11 · Sat Dec 24 2022 09:27:45 GMT+0800 (China Standard Time)

@dazfuller thanks a lot for pushing this forward and keeping us updated here!!
We had a similar issue before, so I guess Databricks breaking compatibility with the Open Source Spark version is sth. we have to keep an eye on...

James Miles · Answer 12 · Thu Apr 13 2023 22:48:29 GMT+0800 (China Standard Time)

Hi All, FYI looks like this has all been resolved by Databricks on 12.1 runtime!