Re-evaluate slice-by-8 or slice-by-16 CRC32

Question

Re-evaluate slice-by-8 or slice-by-16 CRC32

101arrowz opened this issue 3 years ago · comments

Hello! I've been working on a ZIP implementation in JavaScript myself, fflate. While trying to optimize CRC32 I devised a very fast, pure JS implementation using slice-by-16 CRC32. From my testing this is consistently faster than your current WASM implementation by 20-30%. I noticed you mentioned in the README that slice-by-16 didn't yield meaningful improvements, but this seems to suggest otherwise. You're free to use the JS code if you'd like, but maybe a WASM CRC32 implementation with Slice-by-16 could be faster? I'd be happy to help with implementing my CRC32 code or developing a fast WASM CRC32 implementation if you're interested. Thanks for maintaining this library!

Rishi Uttam · Answer 1 · Wed Jul 07 2021 14:01:43 GMT+0800 (China Standard Time)

fflate looks interesting, but i tried your streaming demo for a 3.7gb file it finished in

Finished in 117343.200ms: Length 1828438463

the sw also errored in the console with

Uncaught (in promise) TypeError: Failed to execute 'fetch' on 'WorkerGlobalScope': 'only-if-cached' can be set only with 'same-origin' mode
at sw.ts:27

No file was streamed, but I am sure it works as described.

Vs client-zip starts streaming immediately and works on files larger then 4g.

I am happy to do more testing.

David Junger · Answer 2 · Wed Jul 07 2021 14:10:22 GMT+0800 (China Standard Time)

Interesting. After the first 4 XORs, you don't cycle the bytes anymore. You actually get the correct result that way ? if that works, I can use it to decrease the number of ops per iteration in wasm as well. BTW, if you want to compare with the SIMD-enabled slice-by-4 I'm working on, use this for the wasm source in my lib (you can see the source here) :

AGFzbQEAAAABCgJgAABgAn9/AXwDAwIAAQUDAQACBw0DAW0CAAF0AAABYwABCusCApgBAQN/A0AgASEAQQAhAgNAIABBAXYgAEEBcUGghuLtfmxzIQAgAkEBaiICQQhHDQALIAFBAnQgADYCACABQQFqIgFBgAJHDQALQQAhAQNAQQAhAgNAIAIgAXIoAgAiAEH/AXFBAnQoAgAgAEEIdnMhACACQYAIaiICIAFyIAA2AgAgAkGAGEcNAAsgAUEEaiIBQYAIRw0ACwvOAQICfwF7IAFBf3MhAUGAgAQhAkGAgAQgAGoiAEECdUECdCIDQYSABE8EQANAIAEgAigCAHP9Ef0MA////wL///8B////AP////0OQQL9qwH9DAAAAAAABAAAAAgAAAAMAAD9UCIE/RsAKAIAIAT9GwEoAgBzIAT9GwIoAgBzIAT9GwMoAgBzIQEgAkEEaiICIANHDQALCyACIABJBEADQCABQf8BcSACLQAAc0ECdCgCACABQQh2cyEBIAJBAWoiAiAASQ0ACwsgAUF/c7gL

Also, keep in mind that my initial slice-by-16 attempt didn't improve on the Sarwate implementation because at the time, I wasn't using any SIMD instruction. I never said slice-by-16 would not help someday.

David Junger · Answer 3 · Wed Jul 07 2021 14:30:49 GMT+0800 (China Standard Time)

I'd love to have your feedback on #21 by the way. I'm still looking for a good way to bundle the optional SIMD implementation together with the original (needed for browser compatibility).

101arrowz · Answer 4 · Wed Jul 07 2021 14:44:27 GMT+0800 (China Standard Time)

@ricky11 That's mainly because I accumulate into a blob instead of using streamsaver or a hand-rolled streaming downloader via a service worker, which client-zip does do in its demo (bad idea in hindsight). Actually the streaming demo outright does not download the compressed file. Try editing the codebox to import streamsaver and pipe the output stream to a file. If you pass it through streamsaver you'll see that fflate does actually stream the output incrementally and quickly. As for files larger than 4GB, I felt that there is no way you'd want to compress something that large in JavaScript, so you might as well use something specifically designed for massive files without compression like client-zip. Therefore, Zip64 compression support is unplanned (but Zip64 decompression is already available).

Also, the console error seems like a Chrome bug.

I'll look into the SIMD implementation soon, but as for #21, I have two suggestions:

Use WebAssembly.validate(new Uint8Array([0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 123, 3, 2, 1, 0, 10, 10, 1, 8, 0, 65, 0, 253, 15, 253, 98, 11])) to check for SIMD support (returns a boolean) - this is from Google's wasm-feature-detect.
Don't try to micro-optimize. Just embed both SIMD and non-SIMD in full into the source, it probably yields lower bundle size after gzipping anyway (due to LZ backreferences), plus it is way more maintainable.

101arrowz · Answer 5 · Wed Jul 07 2021 15:00:55 GMT+0800 (China Standard Time)

@Touffy Nice, the WASM implementation is already faster than JS slice-by-16 when SIMD is enabled! However it's usually only by 2% from my tests and is actually slower on smaller files than the JavaScript implementation. Honestly I would use the JS implementation over that SIMD WASM because the bundle size is smaller for JS and the performance is nearly identical in the average case and better when SIMD isn't available/when the files are very small.

I've seen that the CRC32 implementation in hash-wasm is actually faster than the JS implementation by a decent amount in some engines and never slower, so maybe try that?

David Junger · Answer 6 · Fri Jul 09 2021 15:22:08 GMT+0800 (China Standard Time)

Vs client-zip starts streaming immediately and works on files larger then 4g.

@ricky11 did you maybe input cross-origin URLs without CORS ? last I checked, the demo works fine as long as it's allowed to fetch the target URLs (if it's not, you'll see a CORS error in the console and nothing at all in the ZIP file, and the download will finish instantly).

@101arrowz as far as I can see, small files are not a big concern. Unless you're zipping a very large GitHub repo with tens of thousands of files, you're unlikely to hit processing times above a few hundred milliseconds (network time is another matter) on a semi-modern computer, and a 5% speedup isn't noticeable then. Where the processing time goes above one second is in the hundred-MB range at least, and those archives are more likely to contain larger files such as photos. And I do think it's a legitimate use-case for client-zip, even above 4GB. I only need CRC32 to run as fast as the network, and 80 MB/s is achievable with wasm. If I get that, then I can generate the ZIP as fast as the browser receives data, which means the user gets their ZIP as fast as if I'd generated it on a server, and I don't need to pay for the processing power.

I may be wrong, but until someone brings me a real-world use case where small-file performance could save at least 100ms, I'm OK with keeping my loose async loop and polymorphic input type (which is probably responsible for the difference you see with your synchronous and more direct JavaScript loop) for the sake of readability and not having to reimplement streams like you do in your fflate.

You're right about embedding the two versions as base64. I was actually thinking of embedding just the small one (without SIMD) so that the lib can be called synchronously, and then, after feature testing, fetching the larger binary from a separate file (not base-64 encoded ; an actual .wasm) and replacing the function references once that's instantiated. It runs the risk that the developer will fail to put the SIMD-enabled .wasm file in the right place, but at least if the fetch fails, the library keeps working (without SIMD).

David Junger · Answer 7 · Fri Jul 09 2021 15:28:13 GMT+0800 (China Standard Time)

About hash-wasm : I found it while researching crc32 in wasm a few months back, but their SIMD version for CRC32 is over three times bigger than mine. I'll need to run proper performance tests to determine if it's even worth looking into.

David Junger · Answer 8 · Mon Jul 19 2021 01:27:39 GMT+0800 (China Standard Time)

@101arrowz I've run some new benchmarks with large(-ish) datasets that show a tiny advantage to fflate on small file performance, as you suspected, while client-zip has a tiny advantage on large file performance even without SIMD. fflate also has a substantial advantage in Chrome (or rather, Chrome sucks at running client-zip). I suspect that Chrome has some issues with memory management when buffering a Stream into a Blob (both client-zip and Conflux are used that way in my tests, whereas I create a new Blob from an array of parts when using fflate). Maybe using a BYOB Stream would fix the issue (I'm still waiting for better support).

I haven't published the benchmarks with SIMD yet, in part because I'm waiting to finish the branch and in part because they don't show any improvement at all in Chrome (where SIMD is supposed to be supported), which is really weird.

101arrowz · Answer 9 · Mon Jul 19 2021 02:50:34 GMT+0800 (China Standard Time)

Actually I didn't even implement the fast CRC32 algorithm into fflate because CRC32 computation was by far the least expensive part of ZIP generation in fflate, so the difference is almost certainly the result of bad performance in the streams API and/or overhead due to async functions. I suggested it here since CRC32 is basically the only expensive part of ZIP generation in client-zip.

BTW could you publish your benchmark scripts in the repository? It's nice for reference. Also, for bundle size you should consider comparing the size after tree-shaking the other libraries to make it more fair, while mentioning that client-zip can be used without tree-shaking (e.g. from a CDN) and still be tiny. fflate is about 5.2kB minified and 2.5kB gzipped for uncompressed ZIP compression, albeit with an uncomfortable API if you're using modern JS.

David Junger · Answer 10 · Mon Jul 19 2021 14:55:18 GMT+0800 (China Standard Time)

All right. Here is the script.

The dataset with 12k files was an earlier version of this Dominions 5 modding starter pack. The bunch of photos in the other dataset I cannot publish, but I'm sure you can make your own equivalent set of about 37 photos and a video totalling 310 MB.

To run the test, you need to set up an HTTP server (I recommend nginx) to serve those files as efficiently as possible, and possibly edit lines 25 and 27 to point to the right URL. You also need to place a .txt file in the same directory as the index.html, named for each dataset in the array on line 25 (so for example I had a file named "Dom5ModdingStarterPack.txt"), and containing the path to each file of the dataset (one file per line). I generated those by piping ls -R to a file and then doing a bit of manual search-and-replace, though Deno has a very nice fs.walk utility too.

Finally, load the index in your browser (over HTTP, not from the filesystem) and click the buttons. Times are printed in the console.

I'm sure the results would be vastly different for the 12k files dataset if I loaded the files from the filesystem using an <input type=file>.

David Junger · Answer 11 · Mon Jul 19 2021 15:12:08 GMT+0800 (China Standard Time)

And I know fflate is tree-shakeable, I don't mean to make it look bad on purpose in the bundle size comparison and I think it may be misleading to use the tree-shaken size without an explanation that, frankly, would be a waste of space (I think fflate is impressively small even without shaking, really).

As you mention, fflate's API is not (yet ?) optimised for this use case so it's probably best to not present it as just an alternative to client-zip. But if you have a more concise way of using fflate than I did in my benchmark, please show me.

101arrowz · Answer 12 · Tue Jul 20 2021 02:36:31 GMT+0800 (China Standard Time)

The script LGTM, but fflate uses more memory because you don't stream in chunks from ReadableStream. This probably doesn't matter that much for performance though.

I didn't mean to insinuate that I was unhappy with the bundle size comparison, I just noticed you mentioned that the comparison was "unfair" in the README and wanted to offer a solution. You can leave it as is.

In any case, the CRC32 implementation in client-zip is really well done, and if slice-by-N doesn't help its probably the best one possible. Thanks for maintaining this project; it's really nice to have a modern, maintained, and fast alternative to JSZip.

David Junger · Answer 13 · Mon May 08 2023 16:55:03 GMT+0800 (China Standard Time)

@101arrowz I had to re-run some performance tests to check if WebAssembly (without slice-by-8 and SIMD) was still worth it (it really isn't !) and then run them again after I noticed that I'd duplicated fflate's test script for Zip.js. You'll find those fresh tests in the README (not in the npm package, that one still has incorrect results until I publish a new version).

For large files, fflate is still and by far the fastest library in Chrome even beating native performance, while client-zip has improved its lead in Safari.

For many small files, fflate is only a little faster than client-zip in either browser ; piping the source data efficiently is more important in this scenario, so that's what I'll be working on next (in a separate library that should be able to feed any Zip library by providing a buffered stream of Responses instead of the one-at-a-time async generator in my current code).