crealytics / spark-excel

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Reading an .xlsx file with empty columns in the first read line (or first considered, if first line is completely empty) does cut of the remaining columns at the end, after last value in first line, even when specifying a specific range and net reading headers.

+------+------+------+------+------+
|_c0 |_c1 |_c2 |_c3 |_c4 |
+------+------+------+------+------+
|null |null |cat_a |null |cat_b |
|name_a|name_b|name_c|name_d|name_e|
|1_1 |2_1 |3_1 |4_1 |5_1 |
|1_2 |2_2 |3_2 |4_2 |5_2 |
|1_3 |2_3 |3_3 |4_3 |5_3 |
|1_4 |2_4 |3_4 |4_4 |5_4 |
|1_5 |2_5 |3_5 |4_5 |5_5 |
|1_6 |2_6 |3_6 |4_6 |5_6 |
|1_7 |2_7 |3_7 |4_7 |5_7 |
+------+------+------+------+------+

Background: I want to process the first to lines to get a consolidated header. Reading only the data block starting from line 4 works as expected.
This is similar to #631 and #366

Expected Behavior

Reading the full specify range of data.

Steps To Reproduce

    val df = spark.read
      .format("excel")
      .option("dataAddress", "'Sheet1'!A2:G10")
      .option("header", "false") // Required
      .load("test.xlsx")
    df.show(false)

test.xlsx

Environment

- Spark version: 3.3.2
- Spark-Excel version: 0.18.6-beta1
- OS: Windows 10E with IntelliJ
- Cluster environment

Anything else?

No response

Same issue here.

The problem seems to be caused by the filters at

spark-excel/src/main/scala/com/crealytics/spark/excel/v2/DataLocator.scala

Line 57 in f74995b

.filter(_ < r.getLastCellNum())

and

spark-excel/src/main/scala/com/crealytics/spark/excel/v2/DataLocator.scala

Line 46 in f74995b

.filter(_ < r.getLastCellNum())

The implementation of getLastCellNum() from Apache POI ignores the last cells of the row if they are empty, and that is the reason why those columns are removed from the row iterator even if dataAddress explicitly includes them. One possible fix here is not to filter empty cells if they are specified in dataAddress. Other option is to consider the number of columns of the schema.

Hey @cegaspar, thanks for the analysis!
Would you mind giving a PR a try?

Hey @nightscape.
Sure, no problem.

[BUG] last Columns with first line value empty not being read from .xlsx