Broken filenames in zip archives with 1-byte non-latin charset

Question

Broken filenames in zip archives with 1-byte non-latin charset

unxed opened this issue 2 months ago · comments

$ cat ./Desktop.zip | bsdtar -t
\215\256\242\240\357 \257\240\257\252\240/
\215\256\242\353\251 \342\245\252\341\342\256\242\353\251 \244\256\252\343\254\245\255\342.txt

Expected result should be as with unzip:

$ unzip -l ./Desktop.zip
Archive:  ./Desktop.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2016-09-28 18:41   Новая папка/
        4  2016-09-28 18:40   Новый текстовый документ.txt
---------                     -------
        4                     2 files

The built-in .zip archiver in older versions of Windows used DOS (OEM) or Windows (ANSI) code page corresponding to current regional settings for new archives. Lots of such archives still exist.

The correct behavior is to determine the relevant OEM or ANSI code page based on the system locale and use it. You can look at this PR for reference implementation:

p7zip-project/p7zip#232

Joerg Sonnenberger · Answer 1 · Thu May 23 2024 04:36:34 GMT+0800 (China Standard Time)

bsdtar is using isprint(3) to decide what characters are safe to print to the terminal. All others are escaped. So unless your locale is an actual matching single-byte locale and not UTF-8 as used on most Unix systems nowadays by default, this is perfectly sensible behavior.

unxed · Answer 2 · Thu May 23 2024 06:08:47 GMT+0800 (China Standard Time)

Unzipping such archive by bsdtar on Mint 21.3 produces incorrect utf-8 sequences in file names.

Joerg Sonnenberger · Answer 3 · Thu May 23 2024 07:04:42 GMT+0800 (China Standard Time)

Sure, the binary filename is passed through as it doesn't know what to make of it. Remember, nothing on POSIX says that filenames are UTF-8 and the same applies to many binary file formats.