GAM-team / got-your-back

Got Your Back (GYB) is a command line tool for backing up your Gmail messages to your computer using Gmail's API over HTTPS.

Home Page:https://github.com/GAM-team/got-your-back/wiki

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unhandled exception when cleaning message with unicode/emoji in (From:) headers.

Leftium opened this issue · comments

Full steps to reproduce the issue:

  1. Backup email with message that is not saved in UTF8 format has unicode/emoji in From: header.
  2. Restore email using --cleanup.

Expected outcome: GYB gracefully handles unicode/emoji in headers, either:

  • Detecting/reading non UTF8 messages with appropriate encoding.
  • Skipping message.

Actual outcome: GYB exits with unhandled exception:

Traceback (most recent call last):166783)
  File "gyb.py", line 2767, in <module>
  File "gyb.py", line 2239, in main
  File "gyb.py", line 1947, in message_hygiene
  File "gyb.py", line 1891, in cleanup_from
  File "email\utils.py", line 215, in parseaddr
  File "email\_parseaddr.py", line 517, in __init__
  File "email\_parseaddr.py", line 260, in getaddrlist
TypeError: object of type 'Header' has no len()
[31420] Failed to execute script 'gyb' due to unhandled exception!

Work-around:

  • Convert offending .eml file to UTF8 format. Doesn't always work...
  • Rename .eml file so GYB skips this message.

Suggested alternative fix: always convert non UTF8 files to UTF8 when saving backup.

Notes:

  • The offending email is restored without error if --cleanup is not used. (Did not confirm if text was mangled after restore.)
  • The .eml file was generated by gyb --action backup.
  • Vim tries to open the file with latin1 encoding, but the text is mangled.
  • Notepad.exe tries to open the file with UTF8 encoding, but the text is mangled.
  • The Gmail 'Download Original' file does not work: text is still mangled.
  • Instead, the Gmail 'View Original' text had to be manually copied and saved as a text file (with encoding UTF-16 LE).
  • (Creating a blank UTF8 file and pasting Gmail 'View Original' text seems to work, too)
  • The problematic text is in Korean.
  • I was able to create a minimal repro of this issue in the python REPL:
Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> f = open('2021/8/14/17b437668e8b5c17.eml', 'rb')
>>> bytes = f.read()
>>> m = email.message_from_bytes(bytes)
>>> m['to']
'J***********y<j***@l*****m.com>'
>>> m['from']
<email.header.Header object at 0x000002B33DED8410>
>>> len(m['from'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'Header' has no len()

Mangled text:

From: "(주)한웰�쇼핑"<help@daisomall.co.kr>
To: J***********y<j***@l*****m.com>
Subject: [´ÙÀ̼Ҹô] °³ÀÎÁ¤º¸ À¯È¿±â°£Á¦¿¡ µû¸¥ ÈÞ¸é°èÁ¤ Àüȯ ¾È³»µå¸³´Ï´Ù.

Proper text:

From: "(주)한웰이쇼핑" <help@daisomall.co.kr>
To: "J***********y" <j***@l*****m.com>
Subject: [다이소몰] 개인정보 유효기간제에 따른 휴면계정 전환 안내드립니다.

update: This issue isn't limited to non-UTF8 files.

Some UTF8 encoded files also throw this exception. For example, if the From header has emoji:

From:🔥Keto_Rapid_Diet🔥 <xafnsbqsmgniwdztev@twhzbt.drivefact.org>

There were also more emails from the the Korean address (From: "(주)한웰이쇼핑" <help@daisomall.co.kr>) that failed to restore even after converting the .eml file to UTF8 and ensuring there were no mangled characters.

The best work-around seems to be to rename these .eml files so gyb skips them.

I modified my gyb.py to catch these exceptions, printing the problem message info and continuing with the remaining messages:

  if options.cleanup:
      try:
          full_message = message_hygiene(full_message)
      except TypeError as error:
          print(
              f'WARNING! error cleaning message {message_num} ({message_filename})')
          print(f'  {error}')
          print(f'  this message will be skipped.')
          continue

Compare to original code.

Got the fix on StackOverflow: policy=email.policy.SMTPUTF8

I confirmed Korean was restored without mangling, but the emoji ended up being mangled. Perhaps because the emoji from name not wrapped in quotes? Not a big deal since emoji was from a spam email.

def message_hygiene(msg):
    '''Ensure Message-Id, Date and From headers are valid. Replace if not.'''
    omsg = email.message_from_bytes(msg, policy=email.policy.SMTPUTF8)
    orig_id = omsg['message-id']
    orig_date = omsg['date']
    orig_from = omsg['from']