TSV Parses ""'s in Columns Incorrectly
agrow opened this issue · comments
Greetings!
TLDR: Sections that begin with a quoted item but includes other text afterwards, such as:
"Hello" world
Will parse as
"Hello"<tab> world
which is two entries rather than one.
My data includes plain text and is exported to the tsv file correctly (verified with visual white space viewed in Word). However, when it is imported via d3.tsv, it splits an entry such as the one above into two, shoving over all my other data.
I do not have time to make an isolated test right now, but here are some screenshots of the incorrect parse (and one photo including an adjacent correct parse).
Per RFC 4180:
If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
If you want the parsed value of the field to be "Hello" world
, the serialized text for that field should be """Hello"" world"
.
That link is for CSV, not TSV.
Also, Hello "world"
parses correctly. Double quotes "may not appear" only when they begin the field. In a TSV.
This repo generalizes CSV as defined in RFC 4180 to support other delimiters besides comma, including tab (\t
). But otherwise it adheres to that specification.
The fact that Hello "world"
parses as expected does not mean that the input is correctly formatted; the correct input format should be "Hello """world"""
in that case. This library doesn’t validate the input, and so its behavior is undefined if you give it invalid input.