Formatted header breaks email readability when received on mail user agent
pacellig opened this issue · comments
Release or branch I am using: v0.8.1
Parsing formatted headers using the library leads to an email that is no longer correctly parsable by mail user agents. A sample header is attached here for readability with the original formatting.
This has been introduced in this commit, function readHeader
in header.go file, lines 107-113.
What is obtained after parsing the incoming email:
0.00 BSF_BESS_OUTBOUND META: BESS Outbound 0.00 HTML_MESSAGE BODY: HTML included in message 0.01 SUBJ_ALL_CAPS META: Subject is all capitals 0.20 BSF_SC0_SAXXX META: Custom Rule BSF_SC0_SAXXX X-Bess-Outbound-Spam-Report: Code version 3.2, rules version 1.2.3.4 [from scanhost.whatever.wherever.com] Rule breakdown below pts rule name description ---- ---------------------- --------------------------------
Which breaks parsing and displaying.
Further elaboration on this issue: the test case implemented in handler_test.go#513 is wrong, as it does not respect the long header fields definition of RFC2822.
{ input: "X-Not-Continuation: line1=foo;\n X-Next-Header: bar\n", hname: "X-Not-Continuation", want: "line1=foo;", correct: true, },
The correct result in this case would be:
{ input: "X-Not-Continuation: line1=foo;\n X-Next-Header: bar\n", hname: "X-Not-Continuation", want: "line1=foo; X-Next-Header: bar", correct: true, },
I think we need to do better than just reverting #149 - as it appears it is preventing a full-stop error, which is worse than a header with misformatted whitespace in my opinion. Perhaps I am misunderstanding and your scenario also results in a enmime error?
I agree, but what is still unclear to me is why we care about the equal sign or column in the first place. RFC-wise basically everything is allowed in a folded header, except for CRLF.
Would it be possible to get a sample from the #149 case? If so, I would be happy to help and cover this case.
In case we don't hear back from requaos, here's my thoughts... please treat as brainstorming and challenge if you disagree. :)
Single stray indent
x-header-one: value
x-header-two: value
x-header-three: value
should be treated as three separate headers, no continuation. I believe this is what #149 was intended to handle.
Run of indented "headers"
x-header-one: value
x-continue: value
x-continue: value
x-header-two: value
It's seems unlikely that a bunch of actual headers would all be indented, so this should be treated as a continuation.
Normal continuation
x-header-one: value
non-header-like continuation
Treat as continuation.
The only challenge I see is: how can we distinguish between this two cases?
Even though I see the point and idea behind it, my opinion is that sticking to RFCs is the best way to avoid unexpected problems when using the library (but this is a personal opinion).
Also, checking the proposed headers against mxtoolbox, you can see that both [1] and [2] are treated the same, that is they are all folded headers from this perspective.
[1] https://mxtoolbox.com/Public/Tools/EmailHeaders.aspx?huid=5e86010a-d94b-4477-b101-c1a48d6d26e6
[2] https://mxtoolbox.com/Public/Tools/EmailHeaders.aspx?huid=29031bb3-2075-4ae7-a9d0-b6351326fe70
The only challenge I see is: how can we distinguish between this two cases?
For these two specific cases, it looks like we can distinguish between them by looking for a continuation that, other than its folding whitespace, "looks like" a new header. I.e., the line is a header prefixed only with a space. I think that will allow the original test case to pass, while also catching your case properly as folding whitespace.
I'm including your problematic header here for easier viewing (rather than it being linked out to a file that has to be downloaded, then viewed):
X-BESS-Outbound-Spam-Report: Code version 3.2, rules version 1.2.3.4 [from
scanhost.whatever.wherever.com]
Rule breakdown below
pts rule name description
---- ---------------------- --------------------------------
0.00 HTML_MESSAGE BODY: HTML included in message
0.00 BSF_BESS_OUTBOUND META: BESS Outbound
0.20 BSF_SC0_SAXXX META: Custom Rule BSF_SC0_SAXXX
0.01 SUBJ_ALL_CAPS META: Subject is all capitals
Since none of those folded lines begin with something that could be mistaken for a header, that logic should work fine for the original case, and this case.
Even though I see the point and idea behind it, my opinion is that sticking to RFCs is the best way to avoid unexpected problems when using the library (but this is a personal opinion).
The stdlib actually follows the RFCs pretty well. This package is more flexible and handles things we've seen in the real world (which often does a bad job at following RFCs).
I used the stdlib at first, but later changed to this as I encountered email I needed to parse that didn't quite follow the RFCs.
Thanks for the clarification, I see the reasons behind it.I updated the PR with something that should behave well in all the mentioned cases, please let me know your thoughts on that.