crealytics / spark-excel

A Spark plugin for reading and writing Excel files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Data is not being read using streaming approach.

kvirund opened this issue · comments

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When I read from an Excel file using a streaming Excel reader (with maxRowsInMemory options set), data is not being read from the file. It happens for Excel files where the dimension section contains an open-ended data address. For example, A1.

It looks like the problem is in this method:

  private def rowIndices(sheet: Sheet): Range =
    (math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
      math.min(dataAddress.getLastCell.getRow, sheet.getLastRowNum))

When dimension doesn't have the right bottom end, sheet.getLastRowNum has a default value equal to 0. It cuts off all the rows in the sheet.

I would suggest to fix it like

  private def rowIndices(sheet: Sheet): Range =
    (math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
      dataAddress.getLastCell.getRow)

because it is not always possible to figure out the last row num using the dimension field.

Expected Behavior

All data is supposed to be read.

Steps To Reproduce

Just a simple read from an Excel file generated using the streaming-excel-reader library, with random data.

The file it generates contains an A1 value in the dimension record.

Environment

- Spark version: 3.1.3
- Spark-Excel version: 0.18.5
- OS: MacOS

Anything else?

No response

Hi, can you provide a small excel file that demonstrates this behavior?

Sure. This one should work. It is reproducible on any file generated using the Streaming Excel Workbook library when it is being read with the maxRowsInMemory parameter (i.e. using the same library). Just Spark Excel doesn't account for that behavior of setting getLastRowNum to zero.

excel-files_incremental-size_excel-1MB.xlsx