d3 / d3-dsv

A parser and formatter for delimiter-separated values, such as CSV and TSV.

Home Page:https://d3js.org/d3-dsv

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Remove byte-order markers from CSV files?

robinhouston opened this issue · comments

I’ve noticed that Excel now saves UTF-8 CSV files with a BOM. (I’m using Microsoft Excel for Mac version 15.33, saving in “CSV UTF-8” format.)

When such files are parsed with csvParse, the key corresponding to the first column has a zero-width non-breaking space as its first character, which leads to a situation where d["keyName"] is undefined even though keyName appears when you print out d!

I’m not sure whether you think this should be addressed in the parser – if not it should at least be documented I think.

Can you attach an example file I can use for testing purposes?

Sure! GitHub won’t let me attach a .csv file, so I’ve zipped it.
Workbook1.csv.zip

Interestingly if you use FileReader.readAsText, it automatically strips the BOM bytes for you, per the Encoding specification.

Seems like XMLHttpRequest and Fetch also automatically strip the BOM. Here’s a CORS-accessible URL I tested:

https://rawgit.com/mbostock/3fe6055309cff87cba4103837d914fee/raw/48cec3b15411fe2a9d9f678c5988d03b3988f498/test.csv

So my question is how are you getting a string with the BOM still in it? It seems like the BOM stripping should happen earlier, before it gets to d3-dsv.

Sorry, I should have included a complete repro. I’m getting this in node, by fs.readFile(filename, "utf8", …). It looks as though the node developers have decided against stripping BOMs by default.

It’s okay if I should handle this in the app: I just thought I should flag it.

Okay. I’m going to close this issue. If you want to submit a pull request with an edit to the README suggesting that Node users use strip-bom that would be 💯 .

Great, will do!

Gah this just got me too. Could we consider adding it directly to d3-dsv? I think the code is considerably shorter than the comment in the README, plus I wasted a good ten minutes, thanks Excel!