tspence / csharp-csv-reader

A lightweight, high performance, zero dependency, streaming CSV reading library for CSharp.

Home Page:http://tedspence.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handling embedded (but escaped) text delimiter

arvindshmicrosoft opened this issue · comments

Hello!
The below case has a text delimiter (") which is also expected within the text itself, so the data source escapes such embedded text delimiters by using the backslash. With this unfortunately the parser mis-handles and reports 7 fields for line 2 when it should be 8.

c1|c2|c3|c4|c5|c6|c7|c8
77|"somestr"|1|"otherstr"|"she said 'walk' rule says 'should not exceed' at last\""|"xyz"|"pqr"|""
88|"jasd"|1|"hmm"|"normal text"|"c6"|"c7"|"c8"

Any possibility to handle this kind of case correctly?

Unfortunately, the CSV spec (if there is one) doesn't include escaping text with backslashes. In the CSV specs that I have observed, text qualifiers such as double quotes are escaped by doubling them up.

The best description I've heard of CSV's encoding policy is this:

  • Fields are delimited by the comma character.
  • If the text of a field includes the comma character, that field is enclosed by a text qualifier such as double quote characters.
  • If the text of a field enclosed by a text qualifier includes that same text qualifier, that text qualifier should be doubled up.

This means the correct encoding for line two would be:

77|"somestr"|1|"otherstr"|"she said 'walk' rule says 'should not exceed' at last"""|"xyz"|"pqr"|

What's worse, because CSV was never fully formalized, different programs solve these edge cases differently. Because of this, if you are looking for a reliable encoding system, you're going to hit lots of problems.

The good news is there are solutions. What would work best?

  1. If you are encoding and decoding objects using my library, my library will encode and decode them consistently.
  2. I notice that your text doesn't actually include embedded pipe symbols. If you don't need embedded pipes, you don't really need text qualifiers either - could you simply avoid parsing double quotes? e.g. pass in a single quote as the text delimiter?
  3. JSON encoding would also work correctly with backslashes.
  4. It's certainly possible to extend my library to support backslashes, but that's not a standard CSV behavior and it would have to be explicitly called out.

I did recently encapsulate all CSV settings in this object for ease of maintenance: https://github.com/tspence/csharp-csv-reader/blob/master/src/CSVSettings.cs

I suppose if you wanted to extend the library, we could consider adding "EscapeCharacter" and set it to null by default, but permit it to be set to the backslash character for your use case?

Thanks for looking at this. I too later found IETF RFC 4180 which indeed specifies doubling up those embedded quotes. I'm closing this issue.