Hashing zips with utf-8 filenames causes an error

Question

Hashing zips with utf-8 filenames causes an error

ninjabear opened this issue 8 years ago · comments

def get_zip_hash(obj):
    """Return a consistent hash of the content of a zip file ``obj``."""
    digest = hashlib.sha1()
    zfile = zipfile.ZipFile(obj, 'r')
    for path in sorted(zfile.namelist()):
        digest.update(six.text_type(path).encode('utf-8'))
        digest.update(zfile.read(path))
    return digest.hexdigest()

taken from utils.py errors when it encounters (non ascii) characters. Unfortunately these characters appear in javac output occasionally (in my case if you're using shapeless or scalaz).

Python handling of character encoding is pretty hairy but the basic problem is:

>>> x = '☹'
>>> print x
☹
>>> x.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
>>> type(x)
<type 'str'>

You can't call .encode('utf-8') on a string that contains utf-8 characters, unless it is already a unicode string - it'd have to be like this:

>>> y = u'☹'
>>> print y
☹
>>> y.encode('utf-8')
'\xe2\x98\xb9'
>>> type(y)
<type 'unicode'>

However since zip filenames aren't necessarily utf-8 (jar files are, but other zips might not be) the fix ideally should be encoding agnostic.

I will submit a PR soon!

Alexander Dean · Answer 1 · Fri Sep 16 2016 01:36:22 GMT+0800 (China Standard Time)

Cheers @ninjabear !