d3 / d3-dsv

A parser and formatter for delimiter-separated values, such as CSV and TSV.

Home Page:https://d3js.org/d3-dsv

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Repeated columns names erase each other for xParse

mcnuttandrew opened this issue · comments

There is a small ambiguity in the way that the tsvParse and csvParse address parsing files with columns that non-unique names. For instance if you have a tsv like

Example A	Example B	Example A
1	5	0
2	5	0
3	5	0
4	5	0

And you run that through tsvParse then you get

[
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  { 'Example A': '0', 'Example B': '5' },
  columns: [ 'Example A', 'Example B', 'Example A' ]
]

The problem of course being that the data from the first Example A column is blown away during the parse. I'm not sure what the right solution to this might be: maybe including some messaging in the docs that column names need to be unique? Or maybe appending an incrementing index to the duplicated columns ('Example A-1' or something). Having recently been bit by this, this is a real hair pulling issues to find/resolve, so any help that might be offered to other people in a similar situation would no doubt be welcomed.

I think it's a good idea, and a possible implementation is given in #73

Note however that it would be a breaking change (people who already have some code running and this type of data expect it to continue working).

I like your solution, but I don't know if it's worth issuing a breaking change. I think just including some stuff in the documentation would probably get most people through the hurdle of identifying this error

I don't know… The thing is that, when the data has this shape (and when you don't control it), it's currently quite difficult to manipulate: you have to load it as text, then fiddle with the first line, then dsv.parse… I've had to do this literally last week.
(Plus, we're going to issue a major version soon, so having a breaking change is not that problematic.)

Oh i didn't know a major version was coming! This seems like a great approach then

I ♻️ my code into a notebook (and added "empty names" as well)
https://observablehq.com/@fil/csv-duplicate-names

Fixed in 8ab1ab8 ; thank you!