crealytics / spark-excel

A Spark plugin for reading and writing Excel files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Cannot extract data from files with notes/comments

alejandro-jb22 opened this issue · comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

The data extraction fails if the Excel file contains comments or notes. It does not matter if the selected sheet contains the note or not, It will break if a comment/note is present anywhere in the file.

This is an example of a note:

image

This is the error message we get:

<command-2593316248496272> in <module>
     14 
     15 
---> 16 df = spark.read.format("com.crealytics.spark.excel")\
     17     .option("header", True)\
     18     .option("inferSchema", True)\

/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    156         self.options(**options)
    157         if isinstance(path, str):
--> 158             return self._df(self._jreader.load(path))
    159         elif path is not None:
    160             if type(path) != list:

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115     def deco(*a, **kw):
    116         try:
--> 117             return f(*a, **kw)
    118         except py4j.protocol.Py4JJavaError as e:
    119             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o604.load.
: java.lang.NoClassDefFoundError: shadeio/poi/schemas/vmldrawing/XmlDocument
	at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.read(XSSFVMLDrawing.java:135)
	at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.<init>(XSSFVMLDrawing.java:123)
	at shadeio.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
	at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661)
	at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:678)
	at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
	at shadeio.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:259)
	at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
	at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
	at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
	at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
	at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
	at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55)
	at scala.Option.fold(Option.scala:251)
	at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16)
	at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15)
	at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50)
	at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32)
	at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
	at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172)
	at scala.Option.getOrElse(Option.scala:189)
	at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171)
	at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:36)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:236)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

Expected Behavior

Data should be extracted and notes/comments ignored.

Steps To Reproduce

Notes bug.xlsx

df = spark.read.format("com.crealytics.spark.excel")\
    .option("header", True)\
    .option("inferSchema", True)\
    .option("dataAddress", "'Sheet1'!B2:D4")\
    .load("Notes bug.xlsx")

Environment

- Spark version: 3.2.1
- Spark-Excel version: 2.12:3.2.1_0.17.1
- Python version: 3
- Databricks 10.3 on Azure

Anything else?

No response

#620 is an attempt to fix issues like this

it may not work with this data source but if you try setting .option("maxRowsInMemory", 100) - this may cause the code to use a different code path that does not read the comments

it may not work with this data source but if you try setting .option("maxRowsInMemory", 100) - this may cause the code to use a different code path that does not read the comments

Tested and seems to be working:

image

It could work as a workaround by now, thanks 👍

@alejandro-jb22 could you try 0.18.0-beta2 without maxRowsInMemory?