d3 / d3-dsv

A parser and formatter for delimiter-separated values, such as CSV and TSV.

Home Page:https://d3js.org/d3-dsv

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decimal mark customization

badosa opened this issue · comments

In languages where the decimal mark is ",", some spreadsheets expect CSVs to use ";" as the delimiter and "," as the decimal mark.

dsv2* has an --input-delimiter and *2dsv has --output-delimiter. Allow the user to specify a decimal mark different than the default "." (--input-decimal / --output-decimal).

This library does not perform any number formatting or parsing, and so has no concept of decimal mark (other than the default string coercion behavior of JavaScript). You’ll need to format your numbers prior to generating DSV (say using a localized version of d3-format), and similarly specify your own number parser (perhaps also using d3-format) if you want this functionality.

This library does not perform any number formatting or parsing

I can see this... For i18n sake, it seems to me that it should, though. The solution you suggest seems a very complex way to get, from json2dsv, a usable CSV in German, French, Spanish, Italian, Swedish, Norwegian, Danish, Dutch, Czech...

I'm aware that number handling is not mentioned in RFC 4180 (which is an Informational RFC) as it only refers to text-based fields. Because of this (everything is text), and as much as I dislike it, localized versions of CSV-consuming software usually require localized CSVs (they apply a localized conversion of strings to numbers). That makes CSV language-dependent, while JSON is not. Conversion tools between the two should, IMHO, take this into account.

But I understand your reasons (RFC 4180). I'm just arguing that the CLI of d3-dsv would benefit from considering CSV in (international) practice.

Decoupling the handling of delimiter-separated text fields from handling of numbers greatly simplifies the code (both the internal implementation and the interface). Requiring this library to know which fields are numbers would require the DSV format to store metadata, and there is no broadly-accepted convention for doing this. So while I appreciate your desire to make this process easier, I do not see a good way to make it easier than it already is.

Thank you, @mbostock, to take the time to answer this

I gather that the command-line interface is just a by-product of d3-dsv that might not deserve too much effort (apparently it didn't even deserve the inclusion of an important feature like columns that, on the other hand, is supported by dsv.format(rows[, columns])).

Requiring this library to know which fields are numbers would require the DSV format to store metadata

I understand the difficulty of adding this feature to dsv2json, but as you can guess from my last comment my focus is mainly on json2dsv. When the input is an array of objects, dsv.format(rows[, columns]) could take into account the type of each property in the object of the first element of the array if --output-decimal is specificed, and do the proper replacements ([dsv.format(rows[, columns][,decimalChar])]). This is a very limited functionality and not bullet-proof (for example, presence of different types for the same property in different elements, like null used for missing values in the first element of the array) but seems a quite useful addition to the CLI in real life scenarios.

But, again, I understand the CLI (where this feature makes more sense) is not D3.org's priority, so this is probably an unnecessary and ugly addition to the d3-dsv module. That's why I ended up adding a similar functionality to my jsonstat-conv. This module converts JSON-stat (a format which has the needed metadata to detect number fields) to several flavors of JSON (and CSV) that can be used as an input of json2csv: the latter will receive number fields properly translated into strings with the requested decimal mark.

Field delimiter: comma; decimal mark: dot

curl http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/tesem120?precision=1 | jsonstat2arrobj -b geo -d sex,age,unit -t | json2csv > unr.csv

Field delimiter: semicolon; decimal mark: comma

curl http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/tesem120?precision=1 | jsonstat2arrobj -b geo -d sex,age,unit -k -t | json2csv > unr.csv -w ";"

By the way, this works perfectly on a Mac but on Windows json2csv returns:

Error: ENOENT: no such file or directory, stat 'C:\dev\stdin'
    at Error (native)

Could this have to do with the use of "/dev/stdin" on dash.js?

It’s not a question of CLI vs. API. The issue is that the API deals with string input and string output exclusively. Anything that is not a string is coerced to a string using JavaScript’s default behavior (which is not localizable, as far as I am aware).

So again, if you want to control the formatting of numbers to strings, you must format them before passing them to dsvFormat (or *2dsv). And if you want to control the parsing of strings into numbers, you must parse them after receiving them from dsvParse (or dsv2*).

If you want to do this on the command-line, I recommend using ndjson-cli. For example, given the following CSV input:

name,value
fish,1.23

You can reformat the number column to a different locale like so:

csv2json -n < in.csv \
  | ndjson-map -r d3=d3-format '(d.value = d3.formatLocale({decimal: ",", thousands: " ", grouping: [3]}).format("")(+d.value), d)' \
  | json2csv -n \
  > out.csv

Which results in:

name,value
fish,"1,23"

You’ll need to npm install -g ndjson-cli d3-dsv d3-format to get the above to work.

The Windows issue is unrelated to this issue and my guess is it’s an issue with the rw library.

FYI, I’ve also released rw@1.3.3 as my fourth attempt to get rw working on Windows. If you uninstall and reinstall you should get the newer version, and hopefully that will make it work on Windows.

Thank you for the tip on rw@1.3.3. Don't have a Windows machine beside me right now but I'll try it when I have one.

Thank you also for the sample code: I've used and love ndjson-cli but never tried d3-format.

Yes, rw@1.3.3 works on Windows.