Unhandled exception when cleaning message with unicode/emoji in (From:) headers.

Question

Unhandled exception when cleaning message with unicode/emoji in (From:) headers.

Leftium opened this issue 10 months ago · comments

John-Kim Murphy commented 10 months ago

Full steps to reproduce the issue:

Backup email with message that ~~is not saved in UTF8 format~~ has unicode/emoji in From: header.
Restore email using --cleanup.

Expected outcome: GYB gracefully handles unicode/emoji in headers, either:

Detecting/reading non UTF8 messages with appropriate encoding.
Skipping message.

Actual outcome: GYB exits with unhandled exception:

Traceback (most recent call last):166783)
  File "gyb.py", line 2767, in <module>
  File "gyb.py", line 2239, in main
  File "gyb.py", line 1947, in message_hygiene
  File "gyb.py", line 1891, in cleanup_from
  File "email\utils.py", line 215, in parseaddr
  File "email\_parseaddr.py", line 517, in __init__
  File "email\_parseaddr.py", line 260, in getaddrlist
TypeError: object of type 'Header' has no len()
[31420] Failed to execute script 'gyb' due to unhandled exception!

Work-around:

~~Convert offending .eml file to UTF8 format.~~ Doesn't always work...
Rename .eml file so GYB skips this message.

Suggested alternative fix: always convert non UTF8 files to UTF8 when saving backup.

Notes:

The offending email is restored without error if --cleanup is not used. (Did not confirm if text was mangled after restore.)
The .eml file was generated by gyb --action backup.
Vim tries to open the file with latin1 encoding, but the text is mangled.
Notepad.exe tries to open the file with UTF8 encoding, but the text is mangled.
The Gmail 'Download Original' file does not work: text is still mangled.
Instead, the Gmail 'View Original' text had to be manually copied and saved as a text file (with encoding UTF-16 LE).
(Creating a blank UTF8 file and pasting Gmail 'View Original' text seems to work, too)
The problematic text is in Korean.
I was able to create a minimal repro of this issue in the python REPL:

Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> f = open('2021/8/14/17b437668e8b5c17.eml', 'rb')
>>> bytes = f.read()
>>> m = email.message_from_bytes(bytes)
>>> m['to']
'J***********y<j***@l*****m.com>'
>>> m['from']
<email.header.Header object at 0x000002B33DED8410>
>>> len(m['from'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'Header' has no len()

Mangled text:

From: "(ì£¼)í•œì›°ì�´ì‡¼í•‘"<help@daisomall.co.kr>
To: J***********y<j***@l*****m.com>
Subject: [´ÙÀÌ¼Ò¸ô] °³ÀÎÁ¤º¸ À¯È¿±â°£Á¦¿¡ µû¸¥ ÈÞ¸é°èÁ¤ ÀüÈ¯ ¾È³»µå¸³´Ï´Ù.

Proper text:

From: "(주)한웰이쇼핑" <help@daisomall.co.kr>
To: "J***********y" <j***@l*****m.com>
Subject: [다이소몰] 개인정보 유효기간제에 따른 휴면계정 전환 안내드립니다.

John-Kim Murphy · Answer 1 · Tue Aug 15 2023 05:00:32 GMT+0800 (China Standard Time)

update: This issue isn't limited to non-UTF8 files.

Some UTF8 encoded files also throw this exception. For example, if the From header has emoji:

From:🔥Keto_Rapid_Diet🔥 <xafnsbqsmgniwdztev@twhzbt.drivefact.org>

There were also more emails from the the Korean address (From: "(주)한웰이쇼핑" <help@daisomall.co.kr>) that failed to restore even after converting the .eml file to UTF8 and ensuring there were no mangled characters.

The best work-around seems to be to rename these .eml files so gyb skips them.

John-Kim Murphy · Answer 2 · Tue Aug 15 2023 05:48:10 GMT+0800 (China Standard Time)

I modified my gyb.py to catch these exceptions, printing the problem message info and continuing with the remaining messages:

  if options.cleanup:
      try:
          full_message = message_hygiene(full_message)
      except TypeError as error:
          print(
              f'WARNING! error cleaning message {message_num} ({message_filename})')
          print(f'  {error}')
          print(f'  this message will be skipped.')
          continue

Compare to original code.

John-Kim Murphy · Answer 3 · Tue Aug 15 2023 07:12:25 GMT+0800 (China Standard Time)

Got the fix on StackOverflow: policy=email.policy.SMTPUTF8

I confirmed Korean was restored without mangling, but the emoji ended up being mangled. Perhaps because the emoji from name not wrapped in quotes? Not a big deal since emoji was from a spam email.

def message_hygiene(msg):
    '''Ensure Message-Id, Date and From headers are valid. Replace if not.'''
    omsg = email.message_from_bytes(msg, policy=email.policy.SMTPUTF8)
    orig_id = omsg['message-id']
    orig_date = omsg['date']
    orig_from = omsg['from']