martijn / xsv

High performance, lightweight .xlsx parser for Ruby that provides nothing a CSV parser wouldn't

Home Page:https://storck.io/posts/announcing-xsv-1-0-0/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrongly parsed hash

voxik opened this issue · comments

The hash parsing the F2022-10.xlsx is wrong:

$ pry
[1] pry(main)> require 'xsv'
=> true
[2] pry(main)> x = Xsv.open 'F2022-10.xlsx', :parse_headers => true
=> #<Xsv::Workbook:300 sheets=1 trim_empty_rows=false>
[3] pry(main)> s = x.sheets.first
=> #<Xsv::Sheet:320 mode=hash>
[4] pry(main)> s[0]
=> {"Interní číslo"=>"2022009",
 nil=>61,
 "Typ faktury"=>"FAKTURA",
 "Název firmy nebo jméno osoby"=>"Foo bar",
 "IČO"=>1234567,
 "DIČ"=>"CZ1234567",
 "Ulice"=>nil,
 "PSČ"=>"Velehradská 1",
 "Město"=>"686 03",
 "Datum vystavení"=>"Staré Město",
 "Datum splatnosti"=>nil,
 "Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
 "Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
 "Název"=>nil,
 "Množství"=>"EUR",
 "Cena za MJ"=>nil}

The correct output should be:

--- wrong
+++ correct
@@ -7,18 +7,17 @@
 => #<Xsv::Sheet:320 mode=hash>
 [4] pry(main)> s[0]
 => {"Interní číslo"=>"2022009",
- nil=>61,
  "Typ faktury"=>"FAKTURA",
  "Název firmy nebo jméno osoby"=>"Foo bar",
  "IČO"=>1234567,
  "DIČ"=>"CZ1234567",
- "Ulice"=>nil,
- "PSČ"=>"Velehradská 1",
- "Město"=>"686 03",
- "Datum vystavení"=>"Staré Město",
- "Datum splatnosti"=>nil,
- "Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
- "Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
- "Název"=>nil,
- "Množství"=>"EUR",
- "Cena za MJ"=>nil}
+ "Ulice"=>"Velehradská 1",
+ "PSČ"=>"686 03",
+ "Město"=>"Staré Město",
+ "Datum vystavení"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
+ "Datum splatnosti"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
+ "Měna"=>"EUR",
+ "Pořadí"=>1,
+ "Název"=>"Item 1,
+ "Množství"=>2,
+ "Cena za MJ"=>61}

I think that the parser is confused by the empty columns.

Of course this might fall into "and does not deal with most formatting or more advanced functionality." category ...

The problem in this file is that there are multiple cells in the header that are empty (B1, G1, K1, N1, P1). Hash keys in Ruby are unique, so there can be only one 'nil' key and that ends up with the value from P2. This is the unpredicted result the README mentiond:

Be aware that hash mode will lead to unpredictable results if the worksheet has multiple columns with the same header.

As it stands, your best option is to use array mode (parse_headers: false) to read this file.

If anybody can come up with a pull request to ignore the columns with empty header cells in hash mode, I'd be open to merging that. I don't have time to work on that myself right now, unfortunately.

Right, thx for confirming my understanding.

Not sure if I'll have motivation to dig into this myself ATM, because, as you suggested, I have already went with array mode any parsing headers by hand for the moment (and I could probably live also with just indices).

Nevertheless, if I would have time, what would be your preferred solution?

  1. Document this as expected behavior, because any other solution will likely impact the performance which seems to be key goal of this project.
  2. Just detect the there are nils in headers and fail.
  3. Ignore the columns with nil headers.
  4. Something else?

I think it's reasonable to throw an error in cases where there are duplicate headers in hash mode. I could imagine wanting to be alerted of this rather than proceeding with potentially unintended behavior.

My two cents.

Right, I guess it's fair to raise an exception in this situation since the resulting data is useless as illustrated by this issue. I'll leave this issue open until we implement at least that.

Alternatively I'd opt for option 3 out of @voxik 's suggestions above: by implementing a col_skip array in SheetRowHandler we could skip the columns with empty headers. Any data in those colums would be lost. Or we replace the empty headers with a generated header name "Column B", "Column G" so data can be salvaged. That is probably easier to implement as well (in Sheet#parse_headers).