[BUG] Error reading XML stream when using streaming reader and V2

Question

[BUG] Error reading XML stream when using streaming reader and V2

christianknoepfle opened this issue 2 years ago · comments

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Loading a larger xlsx (actually the one provided in #322 ) fails with
com.github.pjfanning.xlsx.exceptions.ParseException: Error reading XML stream at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:126) at com.github.pjfanning.xlsx.impl.StreamingRowIterator.hasNext(StreamingRowIterator.java:626) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
using
spark.read.format("excel") .option("path", "src/test/resources/v2readwritetest/large_excel/largefile-wide-single-sheet.xlsx") .option("header", value = false) .option("maxRowsInMemory", "200") .option("inferSchema", false) .load()

Expected Behavior

Does not fail to load the file :)

Steps To Reproduce

The unit test is here:

The file is here

Environment

- Spark version: 3.2.2
- Spark-Excel version: 0.18.0
- OS: Windows / Ubuntu (WSL2)
- Cluster environment

Anything else?

First I thought that this is an issue with the streaming reader itself. I checked the file with the latest code of on github and it worked fine.

After some debugging I found the issue.

ExcelPartitionReaderFactory.readFile calls ExcelHelper.getRows(). There the workbook gets closed too early. If I simply remove the "finally workbook.close()" the file loads fine. But the workbook stays open, so this is not the right solution. Tried scala.util.using but that didn't help

I think best would be that ExcelPartitionReaderFactory.buildReader returns a wrapped PartitionReaderWithPartitionValues which implements the workbook.close().

I will try a few things and then create a PR to open a discussion on how to fix this. If you have some ideas on how to fix this, please let me know

PJ Fanning · Answer 1 · Sun Oct 02 2022 05:55:00 GMT+0800 (China Standard Time)

Maybe the change should be from:

  def getRows(conf: Configuration, uri: URI): Iterator[Vector[Cell]] = {
    val workbook = getWorkbook(conf, uri)
    val excelReader = DataLocator(options)
    try { excelReader.readFrom(workbook) }
    finally workbook.close()
  }

to

  def getRows(conf: Configuration, uri: URI): CloseableIterator[Vector[Cell]] = {
    val workbook = getWorkbook(conf, uri)
    val excelReader = DataLocator(options)
    try { excelReader.readFrom(workbook) }
    catch {
      case NonFatal(t) =>
        workbook.close()
        throw t
    }
  }

So CloseableIterator is a custom class that wraps the existing Iterator but that maintains a reference to the workbook instance and closes this workbook instance at the end. That 'at the end' bit is possibly tricky. The code using the iterator could call 'close' when it has finished iterating or the custom iterator can work out when the best time to autoclose is. I think the former approach would probably be easier.

PJ Fanning · Answer 2 · Sun Oct 02 2022 06:54:56 GMT+0800 (China Standard Time)

I think fixing this issue with a major refactor is going to be really hard.

I'm not 100% sure not closing the workbooks is the end of the world. Only the streaming workbooks need to have their closing delayed. One hack but ultimately with a lower code rewriting effect would be to create a scala object and just register the streaming workbooks that are created with it. This object could have a function to close all the registered workbooks - which the user can call when they are finished using the excel readers.