Repeated columns names erase each other for xParse

Question

Repeated columns names erase each other for xParse

mcnuttandrew opened this issue 4 years ago · comments

There is a small ambiguity in the way that the tsvParse and csvParse address parsing files with columns that non-unique names. For instance if you have a tsv like

Example A	Example B	Example A
1	5	0
2	5	0
3	5	0
4	5	0

And you run that through tsvParse then you get

[
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  columns: [ 'Example A', 'Example B', 'Example A' ]
]

The problem of course being that the data from the first Example A column is blown away during the parse. I'm not sure what the right solution to this might be: maybe including some messaging in the docs that column names need to be unique? Or maybe appending an incrementing index to the duplicated columns ('Example A-1' or something). Having recently been bit by this, this is a real hair pulling issues to find/resolve, so any help that might be offered to other people in a similar situation would no doubt be welcomed.

Philippe Rivière · Answer 1 · Mon Jun 08 2020 04:08:29 GMT+0800 (China Standard Time)

I think it's a good idea, and a possible implementation is given in #73

Note however that it would be a breaking change (people who already have some code running and this type of data expect it to continue working).

Andrew McNutt · Answer 2 · Mon Jun 08 2020 04:15:27 GMT+0800 (China Standard Time)

I like your solution, but I don't know if it's worth issuing a breaking change. I think just including some stuff in the documentation would probably get most people through the hurdle of identifying this error

Philippe Rivière · Answer 3 · Mon Jun 08 2020 04:28:22 GMT+0800 (China Standard Time)

I don't know… The thing is that, when the data has this shape (and when you don't control it), it's currently quite difficult to manipulate: you have to load it as text, then fiddle with the first line, then dsv.parse… I've had to do this literally last week.
(Plus, we're going to issue a major version soon, so having a breaking change is not that problematic.)

Andrew McNutt · Answer 4 · Mon Jun 08 2020 04:32:28 GMT+0800 (China Standard Time)

Oh i didn't know a major version was coming! This seems like a great approach then

Philippe Rivière · Answer 5 · Mon Jun 08 2020 23:02:27 GMT+0800 (China Standard Time)

I ♻️ my code into a notebook (and added "empty names" as well)
https://observablehq.com/@fil/csv-duplicate-names

Philippe Rivière · Answer 6 · Wed Jun 10 2020 00:31:14 GMT+0800 (China Standard Time)

Fixed in 8ab1ab8 ; thank you!