Unhandled exception when cleaning message with unicode/emoji in (From:) headers.
Leftium opened this issue · comments
Full steps to reproduce the issue:
- Backup email with message that
is not saved in UTF8 formathas unicode/emoji inFrom:
header. - Restore email using
--cleanup
.
Expected outcome: GYB gracefully handles unicode/emoji in headers, either:
- Detecting/reading non UTF8 messages with appropriate encoding.
- Skipping message.
Actual outcome: GYB exits with unhandled exception:
Traceback (most recent call last):166783)
File "gyb.py", line 2767, in <module>
File "gyb.py", line 2239, in main
File "gyb.py", line 1947, in message_hygiene
File "gyb.py", line 1891, in cleanup_from
File "email\utils.py", line 215, in parseaddr
File "email\_parseaddr.py", line 517, in __init__
File "email\_parseaddr.py", line 260, in getaddrlist
TypeError: object of type 'Header' has no len()
[31420] Failed to execute script 'gyb' due to unhandled exception!
Work-around:
Convert offending .eml file to UTF8 format.Doesn't always work...- Rename .eml file so GYB skips this message.
Suggested alternative fix: always convert non UTF8 files to UTF8 when saving backup.
Notes:
- The offending email is restored without error if
--cleanup
is not used. (Did not confirm if text was mangled after restore.) - The .eml file was generated by
gyb --action backup
. - Vim tries to open the file with latin1 encoding, but the text is mangled.
- Notepad.exe tries to open the file with UTF8 encoding, but the text is mangled.
- The Gmail 'Download Original' file does not work: text is still mangled.
- Instead, the Gmail 'View Original' text had to be manually copied and saved as a text file (with encoding UTF-16 LE).
- (Creating a blank UTF8 file and pasting Gmail 'View Original' text seems to work, too)
- The problematic text is in Korean.
- I was able to create a minimal repro of this issue in the python REPL:
Python 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> f = open('2021/8/14/17b437668e8b5c17.eml', 'rb')
>>> bytes = f.read()
>>> m = email.message_from_bytes(bytes)
>>> m['to']
'J***********y<j***@l*****m.com>'
>>> m['from']
<email.header.Header object at 0x000002B33DED8410>
>>> len(m['from'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type 'Header' has no len()
Mangled text:
From: "(주)한웰�쇼핑"<help@daisomall.co.kr>
To: J***********y<j***@l*****m.com>
Subject: [´ÙÀ̼Ҹô] °³ÀÎÁ¤º¸ À¯È¿±â°£Á¦¿¡ µû¸¥ ÈÞ¸é°èÁ¤ Àüȯ ¾È³»µå¸³´Ï´Ù.
Proper text:
From: "(주)한웰이쇼핑" <help@daisomall.co.kr>
To: "J***********y" <j***@l*****m.com>
Subject: [다이소몰] 개인정보 유효기간제에 따른 휴면계정 전환 안내드립니다.
update: This issue isn't limited to non-UTF8 files.
Some UTF8 encoded files also throw this exception. For example, if the From
header has emoji:
From:🔥Keto_Rapid_Diet🔥 <xafnsbqsmgniwdztev@twhzbt.drivefact.org>
There were also more emails from the the Korean address (From: "(주)한웰이쇼핑" <help@daisomall.co.kr>
) that failed to restore even after converting the .eml file to UTF8 and ensuring there were no mangled characters.
The best work-around seems to be to rename these .eml files so gyb skips them.
I modified my gyb.py to catch these exceptions, printing the problem message info and continuing with the remaining messages:
if options.cleanup:
try:
full_message = message_hygiene(full_message)
except TypeError as error:
print(
f'WARNING! error cleaning message {message_num} ({message_filename})')
print(f' {error}')
print(f' this message will be skipped.')
continue
Compare to original code.
Got the fix on StackOverflow: policy=email.policy.SMTPUTF8
I confirmed Korean was restored without mangling, but the emoji ended up being mangled. Perhaps because the emoji from name not wrapped in quotes? Not a big deal since emoji was from a spam email.
def message_hygiene(msg):
'''Ensure Message-Id, Date and From headers are valid. Replace if not.'''
omsg = email.message_from_bytes(msg, policy=email.policy.SMTPUTF8)
orig_id = omsg['message-id']
orig_date = omsg['date']
orig_from = omsg['from']