libarchive / libarchive

Multi-format archive and compression library

Home Page:http://www.libarchive.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Broken filenames in zip archives with 1-byte non-latin charset

unxed opened this issue · comments

Sample archive:
Desktop.zip

$ cat ./Desktop.zip | bsdtar -t
\215\256\242\240\357 \257\240\257\252\240/
\215\256\242\353\251 \342\245\252\341\342\256\242\353\251 \244\256\252\343\254\245\255\342.txt

Expected result should be as with unzip:

$ unzip -l ./Desktop.zip
Archive:  ./Desktop.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2016-09-28 18:41   Новая папка/
        4  2016-09-28 18:40   Новый текстовый документ.txt
---------                     -------
        4                     2 files

The built-in .zip archiver in older versions of Windows used DOS (OEM) or Windows (ANSI) code page corresponding to current regional settings for new archives. Lots of such archives still exist.

The correct behavior is to determine the relevant OEM or ANSI code page based on the system locale and use it. You can look at this PR for reference implementation:

p7zip-project/p7zip#232

bsdtar is using isprint(3) to decide what characters are safe to print to the terminal. All others are escaped. So unless your locale is an actual matching single-byte locale and not UTF-8 as used on most Unix systems nowadays by default, this is perfectly sensible behavior.

Unzipping such archive by bsdtar on Mint 21.3 produces incorrect utf-8 sequences in file names.

Sure, the binary filename is passed through as it doesn't know what to make of it. Remember, nothing on POSIX says that filenames are UTF-8 and the same applies to many binary file formats.