[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark

Question

[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark

jmichaelsoliven opened this issue a year ago · comments

jmichaelsoliven commented a year ago

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When running pyspark code below after in Databricks 12.1 with 3.3.1 Spark runtime:

df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)

df.show(truncate = False)

I received the following error:

An error occurred while calling o3150.showString.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)

I also tried writing dataframe to delta table and received below error:

An error occurred while calling o3063.save.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)

Excel has 11 sheets, I'm trying to read data from 1 sheet only that has 389,862 rows.

Expected Behavior

The resulting dataframe should display and write to delta table correctly

Steps To Reproduce

Set the ff parameters to your desire value:

param_excel_sheet = excel sheet ex: Sheet1
param_excel_row_start = row start ex: A2
param_mountPoint + param_in_adls_raw_path + param_in_file_name = folder path including filename

Then run below code.

df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)

Environment

- Spark version:3.3.1
- Spark-Excel version:3.3.1_0.18.5
- OS: Windows
- Cluster environment: Standard_DS12_v2

Anything else?

No response

github-actions · Answer 1 · Thu Mar 30 2023 12:02:12 GMT+0800 (China Standard Time)

Please check these potential duplicates:

[#712] [BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark (70.56%)
[#682] [BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark (65%)
If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.

github-actions · Answer 2 · Thu Mar 30 2023 13:07:10 GMT+0800 (China Standard Time)

Please check these potential duplicates:

[#712] [BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark (70.56%)
[#682] [BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark (65%)
If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.