Formatted header breaks email readability when received on mail user agent

Question

Formatted header breaks email readability when received on mail user agent

pacellig opened this issue 4 years ago · comments

Release or branch I am using: v0.8.1

Parsing formatted headers using the library leads to an email that is no longer correctly parsable by mail user agents. A sample header is attached here for readability with the original formatting.

This has been introduced in this commit, function readHeader in header.go file, lines 107-113.

What is obtained after parsing the incoming email:
0.00 BSF_BESS_OUTBOUND META: BESS Outbound 0.00 HTML_MESSAGE BODY: HTML included in message 0.01 SUBJ_ALL_CAPS META: Subject is all capitals 0.20 BSF_SC0_SAXXX META: Custom Rule BSF_SC0_SAXXX X-Bess-Outbound-Spam-Report: Code version 3.2, rules version 1.2.3.4 [from scanhost.whatever.wherever.com] Rule breakdown below pts rule name description ---- ---------------------- --------------------------------

Which breaks parsing and displaying.

Giuseppe Pacelli · Answer 1 · Thu Sep 10 2020 15:22:10 GMT+0800 (China Standard Time)

Further elaboration on this issue: the test case implemented in handler_test.go#513 is wrong, as it does not respect the long header fields definition of RFC2822.

{ input: "X-Not-Continuation: line1=foo;\n X-Next-Header: bar\n", hname: "X-Not-Continuation", want: "line1=foo;", correct: true, },

The correct result in this case would be:
{ input: "X-Not-Continuation: line1=foo;\n X-Next-Header: bar\n", hname: "X-Not-Continuation", want: "line1=foo; X-Next-Header: bar", correct: true, },

James Hillyerd · Answer 2 · Thu Sep 10 2020 23:03:45 GMT+0800 (China Standard Time)

I think we need to do better than just reverting #149 - as it appears it is preventing a full-stop error, which is worse than a header with misformatted whitespace in my opinion. Perhaps I am misunderstanding and your scenario also results in a enmime error?

James Hillyerd · Answer 3 · Thu Sep 10 2020 23:11:24 GMT+0800 (China Standard Time)

cc @requaos in case he has ideas.

Looking at the code in #149 it cares about where the colon is in relation to equal signs, but I think it will accept a header with spaces in its name, instead of considering that a continuation. That seems problematic, as I'd expect it to error out later.

Giuseppe Pacelli · Answer 4 · Thu Sep 10 2020 23:36:24 GMT+0800 (China Standard Time)

I agree, but what is still unclear to me is why we care about the equal sign or column in the first place. RFC-wise basically everything is allowed in a folded header, except for CRLF.
Would it be possible to get a sample from the #149 case? If so, I would be happy to help and cover this case.

James Hillyerd · Answer 5 · Sat Sep 12 2020 23:39:50 GMT+0800 (China Standard Time)

In case we don't hear back from requaos, here's my thoughts... please treat as brainstorming and challenge if you disagree. :)

Single stray indent

x-header-one: value
  x-header-two: value
x-header-three: value

should be treated as three separate headers, no continuation. I believe this is what #149 was intended to handle.

Run of indented "headers"

x-header-one: value
  x-continue: value
  x-continue: value
x-header-two: value

It's seems unlikely that a bunch of actual headers would all be indented, so this should be treated as a continuation.

Normal continuation

x-header-one: value
  non-header-like continuation

Treat as continuation.

Giuseppe Pacelli · Answer 6 · Sun Sep 13 2020 18:10:59 GMT+0800 (China Standard Time)

The only challenge I see is: how can we distinguish between this two cases?
Even though I see the point and idea behind it, my opinion is that sticking to RFCs is the best way to avoid unexpected problems when using the library (but this is a personal opinion).
Also, checking the proposed headers against mxtoolbox, you can see that both [1] and [2] are treated the same, that is they are all folded headers from this perspective.

[1] https://mxtoolbox.com/Public/Tools/EmailHeaders.aspx?huid=5e86010a-d94b-4477-b101-c1a48d6d26e6
[2] https://mxtoolbox.com/Public/Tools/EmailHeaders.aspx?huid=29031bb3-2075-4ae7-a9d0-b6351326fe70

Daniel Cormier · Answer 7 · Mon Sep 14 2020 06:45:16 GMT+0800 (China Standard Time)

The only challenge I see is: how can we distinguish between this two cases?

For these two specific cases, it looks like we can distinguish between them by looking for a continuation that, other than its folding whitespace, "looks like" a new header. I.e., the line is a header prefixed only with a space. I think that will allow the original test case to pass, while also catching your case properly as folding whitespace.

I'm including your problematic header here for easier viewing (rather than it being linked out to a file that has to be downloaded, then viewed):

X-BESS-Outbound-Spam-Report: Code version 3.2, rules version 1.2.3.4 [from 
	scanhost.whatever.wherever.com]
	Rule breakdown below
	 pts rule name              description
	---- ---------------------- --------------------------------
	0.00 HTML_MESSAGE           BODY: HTML included in message
	0.00 BSF_BESS_OUTBOUND      META: BESS Outbound 
	0.20 BSF_SC0_SAXXX          META: Custom Rule BSF_SC0_SAXXX 
	0.01 SUBJ_ALL_CAPS          META: Subject is all capitals

Since none of those folded lines begin with something that could be mistaken for a header, that logic should work fine for the original case, and this case.

Even though I see the point and idea behind it, my opinion is that sticking to RFCs is the best way to avoid unexpected problems when using the library (but this is a personal opinion).

The stdlib actually follows the RFCs pretty well. This package is more flexible and handles things we've seen in the real world (which often does a bad job at following RFCs).

I used the stdlib at first, but later changed to this as I encountered email I needed to parse that didn't quite follow the RFCs.

Giuseppe Pacelli · Answer 8 · Tue Sep 15 2020 14:19:19 GMT+0800 (China Standard Time)

Thanks for the clarification, I see the reasons behind it.I updated the PR with something that should behave well in all the mentioned cases, please let me know your thoughts on that.