[BUG] No thrown exception if schema is provieded, but there is no workbook/sheet (PDF with XLSX Extension)

Question

[BUG] No thrown exception if schema is provieded, but there is no workbook/sheet (PDF with XLSX Extension)

dabruehl opened this issue 4 months ago · comments

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

If you store a PDF test.pdf with the extension XLSX -> text.xlsx and read the file with a schema provided, you get an empty dataframe. Even if there is no workbook and no data to read.

Expected Behavior

If there is no workbook/sheet the reader should throw an exception/error.

Exactly as it happens if you do not add a schema:

options = {
        "header": "true",
        "encoding": "UTF-8",
        "dataAddress": "workbook:A1",
        "mode" : "FAILFAST"
      }
df = spark.read.format("excel").options(**options).load("PATH_TO_YOUR_FILE/FILENAME.xlsx")
df.show()

Py4JJavaError: An error occurred while calling o575.load.
: java.io.IOException: Can't open workbook - unsupported file type: PDF
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:228)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.excel.v2.ExcelHelper.getWorkbook(ExcelHelper.scala:120)
at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:137)
at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160)
at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77)
at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48)
at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:71)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:71)
at org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:65)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:83)
at org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:81)
at org.apache.spark.sql.connector.catalog.Table.columns(Table.java:68)
at org.apache.spark.sql.execution.datasources.v2.FileTable.columns(FileTable.scala:37)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema(FileDataSourceV2.scala:95)
at org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2.inferSchema$(FileDataSourceV2.scala:92)
at com.crealytics.spark.excel.v2.ExcelDataSource.inferSchema(ExcelDataSource.scala:27)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:99)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:237)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:344)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:342)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:397)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:199)
at py4j.ClientServerConnection.run(ClientServerConnection.java:119)
at java.lang.Thread.run(Thread.java:750)

Steps To Reproduce

Store any PDF with the file extension .xlsx and then run following code:

from pyspark.sql.types import *

schema = StructType([
    StructField("Code", DoubleType()),
    StructField("Text", StringType())
])
options = {
        "header": "true",
        "encoding": "UTF-8",
        "dataAddress": "workbook:A1",
        "mode" : "FAILFAST"
      }
df = spark.read.format("excel").schema(schema).options(**options).load("PATH_TO_YOUR_FILE/FILENAME.xlsx")
df.show()

+----+----+
|Code|Text|
+----+----+
+----+----+

Environment

- Spark version: 3.5.0
- Spark-Excel version: 0.20.2
- OS: Databricks Runtime 14.3 LTS
- Cluster environment

Anything else?

No response