[BUG] Cannot extract data from files with notes/comments
alejandro-jb22 opened this issue · comments
Alejandro Box commented
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
The data extraction fails if the Excel file contains comments or notes. It does not matter if the selected sheet contains the note or not, It will break if a comment/note is present anywhere in the file.
This is an example of a note:
This is the error message we get:
<command-2593316248496272> in <module>
14
15
---> 16 df = spark.read.format("com.crealytics.spark.excel")\
17 .option("header", True)\
18 .option("inferSchema", True)\
/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
156 self.options(**options)
157 if isinstance(path, str):
--> 158 return self._df(self._jreader.load(path))
159 elif path is not None:
160 if type(path) != list:
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o604.load.
: java.lang.NoClassDefFoundError: shadeio/poi/schemas/vmldrawing/XmlDocument
at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.read(XSSFVMLDrawing.java:135)
at shadeio.poi.xssf.usermodel.XSSFVMLDrawing.<init>(XSSFVMLDrawing.java:123)
at shadeio.poi.ooxml.POIXMLFactory.createDocumentPart(POIXMLFactory.java:61)
at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:661)
at shadeio.poi.ooxml.POIXMLDocumentPart.read(POIXMLDocumentPart.java:678)
at shadeio.poi.ooxml.POIXMLDocument.load(POIXMLDocument.java:165)
at shadeio.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:259)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:118)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:98)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.create(XSSFWorkbookFactory.java:36)
at shadeio.poi.ss.usermodel.WorkbookFactory.lambda$create$2(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.wp(WorkbookFactory.java:329)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:224)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:185)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:55)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:55)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:16)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:15)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:50)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:32)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:32)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:172)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:171)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:36)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:36)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:236)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Expected Behavior
Data should be extracted and notes/comments ignored.
Steps To Reproduce
df = spark.read.format("com.crealytics.spark.excel")\
.option("header", True)\
.option("inferSchema", True)\
.option("dataAddress", "'Sheet1'!B2:D4")\
.load("Notes bug.xlsx")
Environment
- Spark version: 3.2.1
- Spark-Excel version: 2.12:3.2.1_0.17.1
- Python version: 3
- Databricks 10.3 on Azure
Anything else?
No response
PJ Fanning commented
#620 is an attempt to fix issues like this
PJ Fanning commented
it may not work with this data source but if you try setting .option("maxRowsInMemory", 100)
- this may cause the code to use a different code path that does not read the comments
Alejandro Box commented
Martin Mauch commented
@alejandro-jb22 could you try 0.18.0-beta2
without maxRowsInMemory
?