[BUG] last Columns with first line value empty not being read from .xlsx
mand35 opened this issue · comments
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
Reading an .xlsx file with empty columns in the first read line (or first considered, if first line is completely empty) does cut of the remaining columns at the end, after last value in first line, even when specifying a specific range and net reading headers.
+------+------+------+------+------+
|_c0 |_c1 |_c2 |_c3 |_c4 |
+------+------+------+------+------+
|null |null |cat_a |null |cat_b |
|name_a|name_b|name_c|name_d|name_e|
|1_1 |2_1 |3_1 |4_1 |5_1 |
|1_2 |2_2 |3_2 |4_2 |5_2 |
|1_3 |2_3 |3_3 |4_3 |5_3 |
|1_4 |2_4 |3_4 |4_4 |5_4 |
|1_5 |2_5 |3_5 |4_5 |5_5 |
|1_6 |2_6 |3_6 |4_6 |5_6 |
|1_7 |2_7 |3_7 |4_7 |5_7 |
+------+------+------+------+------+
Background: I want to process the first to lines to get a consolidated header. Reading only the data block starting from line 4 works as expected.
This is similar to #631 and #366
Expected Behavior
Reading the full specify range of data.
Steps To Reproduce
val df = spark.read
.format("excel")
.option("dataAddress", "'Sheet1'!A2:G10")
.option("header", "false") // Required
.load("test.xlsx")
df.show(false)
Environment
- Spark version: 3.3.2
- Spark-Excel version: 0.18.6-beta1
- OS: Windows 10E with IntelliJ
- Cluster environment
Anything else?
No response
Same issue here.
The problem seems to be caused by the filters at
andThe implementation of getLastCellNum() from Apache POI ignores the last cells of the row if they are empty, and that is the reason why those columns are removed from the row iterator even if dataAddress explicitly includes them. One possible fix here is not to filter empty cells if they are specified in dataAddress. Other option is to consider the number of columns of the schema.
Hey @cegaspar, thanks for the analysis!
Would you mind giving a PR a try?
Hey @nightscape.
Sure, no problem.