[BUG] Cannot read/ write dataframe after loading file in Databricks 12.1 Runtime 3.3.1 Spark
jmichaelsoliven opened this issue · comments
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
When running pyspark code below after in Databricks 12.1 with 3.3.1 Spark runtime:
df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)
df.show(truncate = False)
I received the following error:
An error occurred while calling o3150.showString.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
I also tried writing dataframe to delta table and received below error:
An error occurred while calling o3063.save.
: com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126)
at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:627)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:513)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
Excel has 11 sheets, I'm trying to read data from 1 sheet only that has 389,862 rows.
Expected Behavior
The resulting dataframe should display and write to delta table correctly
Steps To Reproduce
Set the ff parameters to your desire value:
param_excel_sheet = excel sheet ex: Sheet1
param_excel_row_start = row start ex: A2
param_mountPoint + param_in_adls_raw_path + param_in_file_name = folder path including filename
Then run below code.
df = spark.read.format("com.crealytics.spark.excel")
.option("dataAddress", "'" + param_excel_sheet + "'!" + param_excel_row_start)
.option("header", False)
.option("treatEmptyValueAsNulls", True)
.option("maxRowsInMemory",20)
.option("inferSchema", "false")
.load(param_mountPoint + param_in_adls_raw_path + param_in_file_name)
Environment
- Spark version:3.3.1
- Spark-Excel version:3.3.1_0.18.5
- OS: Windows
- Cluster environment: Standard_DS12_v2
Anything else?
No response
Please check these potential duplicates:
-
[#712] [BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark (70.56%)
-
[#682] [BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark (65%)
If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.
Please check these potential duplicates:
-
[#712] [BUG] Cannot read files into dataframe in Databricks 9.1 LTS Runtime 3.1.2 Spark (70.56%)
-
[#682] [BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark (65%)
If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.