Broken filenames in zip archives with 1-byte non-latin charset
unxed opened this issue · comments
Sample archive:
Desktop.zip
$ cat ./Desktop.zip | bsdtar -t
\215\256\242\240\357 \257\240\257\252\240/
\215\256\242\353\251 \342\245\252\341\342\256\242\353\251 \244\256\252\343\254\245\255\342.txt
Expected result should be as with unzip:
$ unzip -l ./Desktop.zip
Archive: ./Desktop.zip
Length Date Time Name
--------- ---------- ----- ----
0 2016-09-28 18:41 Новая папка/
4 2016-09-28 18:40 Новый текстовый документ.txt
--------- -------
4 2 files
The built-in .zip archiver in older versions of Windows used DOS (OEM) or Windows (ANSI) code page corresponding to current regional settings for new archives. Lots of such archives still exist.
The correct behavior is to determine the relevant OEM or ANSI code page based on the system locale and use it. You can look at this PR for reference implementation:
bsdtar
is using isprint(3)
to decide what characters are safe to print to the terminal. All others are escaped. So unless your locale is an actual matching single-byte locale and not UTF-8 as used on most Unix systems nowadays by default, this is perfectly sensible behavior.
Unzipping such archive by bsdtar on Mint 21.3 produces incorrect utf-8 sequences in file names.
Sure, the binary filename is passed through as it doesn't know what to make of it. Remember, nothing on POSIX says that filenames are UTF-8 and the same applies to many binary file formats.