thejoshwolfe / yauzl

yet another unzip library for node

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

option to emit raw string buffers instead of decoded strings

dweinstein opened this issue · comments

I'm using version 2.6.0 FWIW, node 6.

± node debug bin.js foo.zip
< Debugger listening on [::]:5858
connecting to 127.0.0.1:5858 ... ok
break in bin.js:2
  1
> 2 'use strict'
  3 const extractExec = require('./')
  4 const fs = require('fs')
c
break in index.js:46
 44     // TODO: what if we get multiple plists?
 45     const plist = plists[0]
>46     debugger
 47     getExecStream(fd, plist.CFBundleExecutable, (err, entry, exec) => {
 48       debugger
c
break in index.js:19
 17     zip.on('entry', function onentry (entry) {
 18       if ((/XXXThing.*app\/XXXThing-.*/i).test(entry.fileName)) {
>19         debugger;
 20       }
 21       if (!isOurExec(entry, execname)) { return }
repl
Press Ctrl + C to leave debug repl
> entry.fileName
'Payload/XXXThing-╬▓.app/XXXThing-╬▓'
> execname
'XXXThing-β'

as you can see the execname is right but the entry.fileName is not right utf-8 AFAICT.

Here's a reduced testcase:

± unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  10-03-2016 15:16   ç/
        6  10-03-2016 15:16   ç/hello
---------                     -------
        6                     2 files
± node testcase.js
ç/
ç/hello
Error: not found

Testcase code:

'use strict'

const fromFd = require('yauzl').fromFd
const once = require('once')
const fs = require('fs')

// ± unzip -l test.zip
// Archive:  test.zip //   Length      Date    Time    Name
// ---------  ---------- -----   ----
//         0  10-03-2016 15:16   ç/
//         6  10-03-2016 15:16   ç/hello
// ---------                     -------
//         6                     2 files
//
function issue42 (fd, cb) {
  cb = once(cb)
  fromFd(fd, (err, zip) => {
    if (err) return cb(err)
    zip.on('entry', function onentry (entry) {
      if ((/ç\/hello/).test(entry.fileName)) {
        console.log(entry.fileName)
        cb(entry.fileName)
      }
      console.log(entry.fileName)
    })
    zip.on('end', () => {
      if (!cb.called) {
        cb(new Error('not found'))
      }
    })
  })
}

const fd = fs.openSync(__dirname + '/test.zip', 'r')

issue42(fd, (err, res) => {
  console.log(err, res)
})

test zip:

test.zip

Interesting bug report. The behavior you're seeing from Info-Zip is actually non-standard behavior. yauzl is behaving "correctly" with respect to the zipfile specification.

There are multiple ways for a zipfile to indicate that the filenames are encoded in utf-8, and your zipfile does none of them. According to the spec, if no charset is specified, then cp437 is to be used, which is what yauzl is doing.

I'm not sure why Info-Zip's unzip is making an assumption about the filename being UTF-8. I've read the man page and even spent some time searching the source for the reason for that behavior. The closest I came is an excerpt from the zip man page, which may or may not be relevant:

Though the zip standard requires storing paths in an archive using a specific character set, in practice zips have stored paths in archives in whatever the local character set is.

So the question remains, what should yauzl do in this situation? Should the spec be considered correct, or should "in practice" behavior of popular tools be considered correct? It's a tough call, but I'm leaning toward the spec.

If you'd like to fix your zipfile, try setting general purpose bit 11 in all the entries. That is what yazl does to indicate the filename is to be decoded using utf8. If you're creating the zipfile at a higher level than that, then i suggest using a different library/utility for creating zipfiles, because the one you're using is non-conformant. If you didn't make the zipfile at all, but you got it from a user, then i suggest you forward this paragraph to your user.

I haven't seen general purpose bit 11 mishandled like this in any existing zipfile utility i've tested this with. I can't say for sure, but i believe i've tested this issue with Info-Zip's zip, Windows Compressed Folder, Mac's Archive Utility, and 7-Zip. I'm not as familiar with Java's ZipFile class, python's zipfile module, or WinRAR.

So I don't know how this zipfile came to exist with the filename encoding messed up, but I really don't think I should follow in Info-Zip's nonstandard footsteps on this matter. Following the spec is one of yauzl's design principles, and cp437 support is a feature.

FWIW the zip was created with zip on a mac.

@thejoshwolfe would you consider using something like https://gist.github.com/dweinstein/3125bed0a478e2b0acfccfae91c90fd5#file-guess-encoding-js which is a port to javascript of libzip's _zip_guess_encoding ? So far testing has been ok.

I have a branch you can try out here https://github.com/dweinstein/yauzl/tree/guess-encoding -- all tests are passing for me at the moment.

I would consider adding an option to the open() API (and related API's) that forces all strings in a zipfile to be interpreted as UTF-8. I wouldn't want to do any guessing in this library, because that seems too high-level, error prone, non-standard, etc., but if users simply tell yauzl to deviate from the spec in this regard, that seems like a compromise everyone can be happy with.

Realistically, I'd bet that it's safe to always pass in that flag, if it existed. The only time it would cause a problem is if a zipfile was created with cp437 and actually used the non-ascii part of cp437. It could happen, but it'd probably be a very old zipfile if it ever did.

Does that sound like a viable solution to your problem?

Better proposal: add an option to open() that leaves all strings undecoded as Buffer objects instead of strings. Then you can use any kind of encoding guesser or assume UTF-8 as you wish. I think this is the right solution to this issue.

that sounds pretty reasonable. Having the buffer along with the flags surfaced will definitely allow another library to do the guessing...

published in version 2.7.0