crealytics / spark-excel

A Spark plugin for reading and writing Excel files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark

james-miles-ccy opened this issue · comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When running v2 excel pySpark code below in Databricks 11.3 LTS Runtime:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)

I receive the following error upon attempting to display or use the resulting dataframe:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 101) (10.94.235.131 executor 1): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

Expected Behavior

The resulting Dataframe should display correctly.

Steps To Reproduce

set the folderpath variable to a location containing excel files, and run the below python code in latest runtime of Databricks:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)

Environment

- Spark version:3.3.0
- Spark-Excel version:0.18.5
- OS:Windows 10
- Cluster environment

Anything else?

No response

Hey @james-miles-ccy, the Spark-Excel version should consist of the Spark version and the version of Spark-Excel itself.
You were only specifying the version of Spark-Excel. Can you check you were using 3.3.1_0.18.5?

Yes I am using 3.3.1_0.18.5

Can you check the same thing with a local or other non-Databricks Spark 3.3.0?
We already had the case once where Databricks used a slightly different and not fully API-compatible version of Spark in their Runtime than the officially published one.

I have installed Pyspark/spark-excel locally and V1 format works fine and generates dataframes in 3.3.1 spark version, but using a path for multiple files (ie V2 format) is causing issues where cells are hanging/not completing. I am using the same spark-excel version as stated above.

Is it the same error/issue as on DataBricks?

No, in Databricks you receive the error listed in my original comment, where as local causes endless/ non completing execution.

FYI, this is only an issue for v2, v1 works in both Databricks and local.

I am facing same issue with V2 (Spark version:3.3.0, Spark-excel: 3.3.1_0.18.5). v1 works but not completely. input_file_name() returns empty string.

input_file_name is only supported in v2. Unfortunately, I didn't have time to look into the original issue.

Hey @nightscape. This got mentioned in our implementation as well

I think I've traced the issue down to Databricks using a patched spark runtime in the 11.x runtimes (and 12.0 beta runtime) which includes a change from the master branch of Spark which isn't in the 3.3 support branch.

I'm looking into this further at the moment and I'll shout if I find anything

Just to add an update. I've been talking with Databricks and there's a fix coming which we'll resolve this in the 11.x and 12.x runtimes. Should hopefully be coming in January

@dazfuller thanks a lot for pushing this forward and keeping us updated here!!
We had a similar issue before, so I guess Databricks breaking compatibility with the Open Source Spark version is sth. we have to keep an eye on...

Hi All, FYI looks like this has all been resolved by Databricks on 12.1 runtime!