[BUG] Data is not being read using streaming approach.

Question

[BUG] Data is not being read using streaming approach.

kvirund opened this issue 2 years ago · comments

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When I read from an Excel file using a streaming Excel reader (with maxRowsInMemory options set), data is not being read from the file. It happens for Excel files where the dimension section contains an open-ended data address. For example, A1.

It looks like the problem is in this method:

  private def rowIndices(sheet: Sheet): Range =
    (math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
      math.min(dataAddress.getLastCell.getRow, sheet.getLastRowNum))

When dimension doesn't have the right bottom end, sheet.getLastRowNum has a default value equal to 0. It cuts off all the rows in the sheet.

I would suggest to fix it like

  private def rowIndices(sheet: Sheet): Range =
    (math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
      dataAddress.getLastCell.getRow)

because it is not always possible to figure out the last row num using the dimension field.

Expected Behavior

All data is supposed to be read.

Steps To Reproduce

Just a simple read from an Excel file generated using the streaming-excel-reader library, with random data.

The file it generates contains an A1 value in the dimension record.

Environment

- Spark version: 3.1.3
- Spark-Excel version: 0.18.5
- OS: MacOS

Anything else?

No response

christianknoepfle · Answer 1 · Tue Dec 06 2022 17:33:38 GMT+0800 (China Standard Time)

Hi, can you provide a small excel file that demonstrates this behavior?

Anton Gorev · Answer 2 · Tue Dec 06 2022 22:47:10 GMT+0800 (China Standard Time)

Sure. This one should work. It is reproducible on any file generated using the Streaming Excel Workbook library when it is being read with the maxRowsInMemory parameter (i.e. using the same library). Just Spark Excel doesn't account for that behavior of setting getLastRowNum to zero.

excel-files_incremental-size_excel-1MB.xlsx