"Found the next address, delimited by a comma"

Question

"Found the next address, delimited by a comma"

Kermer opened this issue 7 years ago · comments

MailAddressCollection.Parse fails to parse address when it contains comma in the display name. It splits single address into multiple ones which leads to having some invalid address(es) (first part of display name) and valid address with incorrect displayname.

In the example above it's From address, so when I use MailMessage.From I get the first, invalid, address from the 1st screenshot.

Sebastian Holc · Answer 1 · Wed Jan 03 2018 20:19:18 GMT+0800 (China Standard Time)

I can add that the raw "From" header looks like this

Jeffrey Stedfast · Answer 2 · Wed Jan 03 2018 20:22:59 GMT+0800 (China Standard Time)

Seems like the parser is parsing mail addresses after decoding the header which is causing this sort of problem (unquoted commas in address headers are meant to split addresses).

The parser should be fixed to parse the headers and then decode the names.

Jeffrey Stedfast · Answer 3 · Thu Jan 04 2018 17:53:39 GMT+0800 (China Standard Time)

The patch is wrong :-\

Don't decode the address until after you have completely parsed it and determined which part is the name and which part is the address and then you only want to decode the name portion.

Also: yikes, that address parser code is very... brute force. It should really be token-based like the parser I wrote in MimeKit.

Main loop for parsing address lists: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddressList.cs#L546

Logic for parsing an individual address (used by above loop): https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L555

You'll also note that MimeKit's parser handles unquoted commas in the name portion of the address as well (see https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L623).

Jeffrey Stedfast · Answer 4 · Thu Jan 04 2018 17:56:00 GMT+0800 (China Standard Time)

Keep in mind that the rfc2047 encoded-word token, once decoded, might contain <>'s or @'s or any other character that will screw up the logic in the current MailAddressCollection.Parse() method.

Jeffrey Stedfast · Answer 5 · Thu Jan 04 2018 18:04:35 GMT+0800 (China Standard Time)

For anyone that wants to try a different approach to my parser in MimeKit, you could always try parsing the address list in reverse. This approach probably helps simplify the parser logic a bit because parsing forward makes it difficult to know what the tokens belong to (is it the name token? or is it the local-part of an addr-spec? hard to know until I consume a few more tokens...).

For example, consider the following BNF grammar:

address         =       mailbox / group
mailbox         =       name-addr / addr-spec
name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group           =       display-name ":" [mailbox-list / CFWS] ";"
                        [CFWS]
display-name    =       phrase
word            =       atom / quoted-string
phrase          =       1*word / obs-phrase
addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
domain          =       dot-atom / domain-literal / obs-domain
obs-local-part  =       word *("." word)

Now consider the following email address: "Joe Example" <joe@example.com>

The first token you read will be "Joe Example" and you might think that that token indicates that it is the display name, but it doesn't. All you know is that you've got a quoted-string token. A quoted-string can be part of a phrase or it can be (a part of) the local-part of the address itself. You must read at least 1 more token before you'll be able to figure out what it actually is (obs-local-part makes things slightly more difficult). In this case, you'll get a '<' which indicates the start of an angle-addr, allowing you to assume that the quoted-string you just got is indeed the display-name.

If, however, you parse the address in reverse, things become a little simpler because you know immediately what to expect the next token to be a part of.

It's been an approach that I've been meaning to try, but since my current parser in MimeKit works well, I haven't been able to summon the enthusiasm to rewrite it (as the old saying goes, "if it ain't broke, don't fix it").

Jeffrey Stedfast · Answer 6 · Thu Jan 04 2018 18:14:30 GMT+0800 (China Standard Time)

It may be a good idea to test against all of the address examples in my unit tests as well: https://github.com/jstedfast/MimeKit/blob/master/UnitTests/InternetAddressListTests.cs

For example, the patch here will likely break for addresses like this:

To: "Worthington, William" <william@the-worthingtons.com>

You can't just assume that all commas split addresses.

Sebastian Holc · Answer 7 · Thu Jan 04 2018 18:47:06 GMT+0800 (China Standard Time)

With the newest commit (forget the first one) from the PR, but sadly tested only using Mozilla Thunderbird client, I tested out both encoded (=?iso-8859-2? [...] ?= <mail@ex.com>) and unencoded (" [...] " <mail@ex.com>) addresses.

I gotta agree that the Parse method is a bit messy, but after that fix it seems to work fine for me.
e.g.

To: "Worthington, William" <william@the-worthingtons.com>
Was and is working fine, as it's not encoded, so symbols between quotes are not threated as special symbols
To: =?utf-8?Q?Worthington=2C_William?= <william@the-worthingtons.com>
(decoded: Worthington, William <william@the-worthingtons.com>)
When it doesn't have quotes around it (which is usually the case in encoded names) - earlier it would be split into Worthington and William <william@the-worthingtons.com (thanks to that comma).
If it would be From instead of To you would get invalid address on MailMessage.From
With this fix it's now decoded to "Worthington, William" <william@the-worthingtons.com> and after that we basically go back to point 1.

I'm saw some emails encoded as =?iso-8859-2?B? and they seemed to be working fine as well.

Jeffrey Stedfast · Answer 8 · Thu Jan 04 2018 20:31:00 GMT+0800 (China Standard Time)

Okay, cool.

Sebastian Holc · Answer 9 · Wed Jan 10 2018 17:32:16 GMT+0800 (China Standard Time)

Ok. While this fixed the address issue, it introduced a new bug as apparently Functions.DecodeMailHeader wasn't the best place to put it in 😢. Namely now other headers get quoted and another interesting thing is that the subject headers (so probably other headers as well) are split into parts of max. 32 chars.
So instead of getting subject like
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean et sagittis nisl.
it becomes
"Lorem ipsum dolor sit amet, cons""ectetur adipiscing elit. Aenean ""et sagittis nisl."

Sebastian Holc · Answer 10 · Wed Jan 10 2018 17:50:10 GMT+0800 (China Standard Time)

Above gets patched with e6cd1fc , hopefully without any further surprises.