bertjohnson / OpaqueMail

.NET email library and proxy supporting IMAP, POP3, and SMTP with S/MIME and PGP.

Home Page:https://opaquemail.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Found the next address, delimited by a comma"

Kermer opened this issue · comments

MailAddressCollection.Parse fails to parse address when it contains comma in the display name. It splits single address into multiple ones which leads to having some invalid address(es) (first part of display name) and valid address with incorrect displayname.
image
image
In the example above it's From address, so when I use MailMessage.From I get the first, invalid, address from the 1st screenshot.

I can add that the raw "From" header looks like this
image

Seems like the parser is parsing mail addresses after decoding the header which is causing this sort of problem (unquoted commas in address headers are meant to split addresses).

The parser should be fixed to parse the headers and then decode the names.

The patch is wrong :-\

Don't decode the address until after you have completely parsed it and determined which part is the name and which part is the address and then you only want to decode the name portion.

Also: yikes, that address parser code is very... brute force. It should really be token-based like the parser I wrote in MimeKit.

Main loop for parsing address lists: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddressList.cs#L546

Logic for parsing an individual address (used by above loop): https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L555

You'll also note that MimeKit's parser handles unquoted commas in the name portion of the address as well (see https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L623).

Keep in mind that the rfc2047 encoded-word token, once decoded, might contain <>'s or @'s or any other character that will screw up the logic in the current MailAddressCollection.Parse() method.

For anyone that wants to try a different approach to my parser in MimeKit, you could always try parsing the address list in reverse. This approach probably helps simplify the parser logic a bit because parsing forward makes it difficult to know what the tokens belong to (is it the name token? or is it the local-part of an addr-spec? hard to know until I consume a few more tokens...).

For example, consider the following BNF grammar:

address         =       mailbox / group
mailbox         =       name-addr / addr-spec
name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group           =       display-name ":" [mailbox-list / CFWS] ";"
                        [CFWS]
display-name    =       phrase
word            =       atom / quoted-string
phrase          =       1*word / obs-phrase
addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
domain          =       dot-atom / domain-literal / obs-domain
obs-local-part  =       word *("." word)

Now consider the following email address: "Joe Example" <joe@example.com>

The first token you read will be "Joe Example" and you might think that that token indicates that it is the display name, but it doesn't. All you know is that you've got a quoted-string token. A quoted-string can be part of a phrase or it can be (a part of) the local-part of the address itself. You must read at least 1 more token before you'll be able to figure out what it actually is (obs-local-part makes things slightly more difficult). In this case, you'll get a '<' which indicates the start of an angle-addr, allowing you to assume that the quoted-string you just got is indeed the display-name.

If, however, you parse the address in reverse, things become a little simpler because you know immediately what to expect the next token to be a part of.

It's been an approach that I've been meaning to try, but since my current parser in MimeKit works well, I haven't been able to summon the enthusiasm to rewrite it (as the old saying goes, "if it ain't broke, don't fix it").

It may be a good idea to test against all of the address examples in my unit tests as well: https://github.com/jstedfast/MimeKit/blob/master/UnitTests/InternetAddressListTests.cs

For example, the patch here will likely break for addresses like this:

To: "Worthington, William" <william@the-worthingtons.com>

You can't just assume that all commas split addresses.

With the newest commit (forget the first one) from the PR, but sadly tested only using Mozilla Thunderbird client, I tested out both encoded (=?iso-8859-2? [...] ?= <mail@ex.com>) and unencoded (" [...] " <mail@ex.com>) addresses.

I gotta agree that the Parse method is a bit messy, but after that fix it seems to work fine for me.
e.g.

  1. To: "Worthington, William" <william@the-worthingtons.com>
    Was and is working fine, as it's not encoded, so symbols between quotes are not threated as special symbols
  2. To: =?utf-8?Q?Worthington=2C_William?= <william@the-worthingtons.com>
    (decoded: Worthington, William <william@the-worthingtons.com>)
    When it doesn't have quotes around it (which is usually the case in encoded names) - earlier it would be split into Worthington and William <william@the-worthingtons.com (thanks to that comma).
    If it would be From instead of To you would get invalid address on MailMessage.From
    With this fix it's now decoded to "Worthington, William" <william@the-worthingtons.com> and after that we basically go back to point 1.

I'm saw some emails encoded as =?iso-8859-2?B? and they seemed to be working fine as well.

Okay, cool.

Ok. While this fixed the address issue, it introduced a new bug as apparently Functions.DecodeMailHeader wasn't the best place to put it in 😢. Namely now other headers get quoted and another interesting thing is that the subject headers (so probably other headers as well) are split into parts of max. 32 chars.
So instead of getting subject like
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean et sagittis nisl.
it becomes
"Lorem ipsum dolor sit amet, cons""ectetur adipiscing elit. Aenean ""et sagittis nisl."

Above gets patched with e6cd1fc , hopefully without any further surprises.