Zarc is a new archive file format.
Think like tar
or zip
, not gzip
or xz
.
Warning: Zarc is a toy: it has received no review, only has a single implementation, and is missing many important features. Do not use for production data.
Zarc provides some interesting features, like:
- always-on strong hashing and integrity verification;
- full support for extended attributes (xattrs);
- high resolution timestamps;
- user-provided metadata at both archive and file level;
- basic deduplication via content-addressing;
- minimal uncompressed overhead;
- appending files is reasonably cheap;
- capable of handling archives larger than memory, or even archives containing more file metadata than would fit in memory (allowed by spec but not yet implemented).
Here's a specification of the format.
This repository contains a Rust library crate implementing the format, and a Rust CLI tool. You can install it using a recent stable Rust:
$ cargo install --git https://github.com/passcod/zarc zarc-cli
That installs the zarc
CLI tool.
As we rely on an unreleased version of deku, this isn't yet published on crates.io.
Alternatively, download binaries: https://public.axodotdev.host/releases/github/passcod/zarc
(Some of the commands shown here don't exist yet.)
Get started by packing a few files:
$ zarc pack --output myfirst.zarc a.file and folder
$ ls -lh myfirst.zarc
-rw-r--r-- 1 you you 16K Dec 30 01:34 myfirst.zarc
$ file myfirst.zarc
myfirst.zarc: Zstandard compressed data (v0.8+), Dictionary ID: None
# or, with our custom magic:
$ file -m zarc.magic myfirst.zarc
crates.zarc: Zarc archive file version 1
$ zstd --test myfirst.zarc
myfirst.zarc : 70392 bytes
Zarc creates files that are valid Zstd streams.
However, decompressing such a file with zstd
will not yield your files back, as the file/tree metadata is skipped by zstd
.
Instead, look inside with Zarc:
$ zarc list-files myfirst.zarc
a.file
and/another.one
folder/thirdfile.here
folder/subfolder/a.file
folder/other/example.file
If you want to see everything a Zarc contains, use the debug tool:
$ zarc debug myfirst.zarc
frame: 0
magic: [50, 2a, 4d, 18] (skippable frame)
nibble: 0x0
length: 4 (0x00000004)
zarc: header (file format v1)
frame: 1
magic: [28, b5, 2f, fd] (zstandard frame)
descriptor: 10001001 (0x89)
single segment: true
has checksum: false
unused bit: false
reserved bit: false
fcs size flag: 0 (0b00)
actual size: 1 bytes
did size flag: 0 (0b00)
actual size: 0 bytes
uncompressed size: 137 bytes
...snip...
frame: 8
magic: [28, b5, 2f, fd] (zstandard frame)
descriptor: 11010111 (0xD7)
single segment: true
has checksum: true
unused bit: false
reserved bit: false
fcs size flag: 1 (0b01)
actual size: 2 bytes
did size flag: 0 (0b00)
actual size: 0 bytes
uncompressed size: 55313 bytes
checksum: 0x55C7DC15
block: 0 (Compressed)
size: 3083 bytes (0xC0B)
zarc: directory (directory format v1) (4823 bytes)
hash algorithm: Blake3
directory digest: valid ✅
files: 5
file 0: ZWPZswtyW69gw+VyEGyE2h3ClqK05Y6uJ545LFu3srM=
path: (4 components)
folder
subfolder
a.file
readonly: false
posix mode: 00100644 (rw-r--r--)
posix user: id=1000
posix group: id=1000
timestamps:
inserted: 2023-12-29 11:19:05.747182826 UTC
created: 2023-12-29 04:14:52.160502712 UTC
modified: 2023-12-29 07:22:13.457676519 UTC
accessed: 2023-12-29 07:22:13.787676534 UTC
...snip...
frames: 4
frame 0: ZWPZswtyW69gw+VyEGyE2h3ClqK05Y6uJ545LFu3srM=
offset: 151 bytes
uncompressed size: 390 bytes
frame 1: pN1pVhJbe0vXIgf8VP7TvqquOJZTSUVYW7QEm0XdVdk=
offset: 439 bytes
uncompressed size: 13830 bytes
frame 2: Thzfvpr+lCZCiXOxwuwtZr3mPXLf2tt1oVTSX/g3dpw=
offset: 4528 bytes
uncompressed size: 431 bytes
...snip...
frame: 9
magic: [5e, 2a, 4d, 18] (skippable frame)
nibble: 0xE
length: 8 (0x00000008)
zarc: eof trailer
directory offset: 3233 bytes from end
zarc debug
prints all the information it can, including low-level details from the underlying Zstandard streams.
You can use it against non-Zarc Zstandard files, too.
Try the -d
(to print data sections), -D
(to uncompress and print zstandard frames), and -n 3
(to stop after N frames) options!
Then, to unpack:
$ zarc unpack myfirst.zarc
unpacked 5 files
Internally, a Zarc is a content-addressed store with a directory of file metadata. If you have two copies of some identical file, Zarc stores the metadata for each copy, and one copy of the content.
A major issue with Tar and Tar-based formats is that you can't extract a single file or list all the files in the archive without reading (and decompressing) the entire file. Zarc's directory is read without reading nor decompressing the rest of the file, so listing files and metadata is always fast. Zarc also stores offsets to file contents within the directory, so individual files can be efficiently unpacked.
Zarc computes the cryptographic checksum of every file it packs, and verifies data when it unpacks. It also stores and verifies the integrity of its directory using that same hash function.
You can verify integrity cheaply by comparing the digest of the directory only, instead of hashing the entire file. For ease of use, external digest verification is built in the tool:
$ zarc pack --output file.zarc folder/
digest: puKGv1aG1ANEq7wBxnrJbJ2OPcpBizcG+/sBM89G9fQ=
$ zarc unpack --verify puKGv1aG1ANEq7wBxnrJbJ2OPcpBizcG+/sBM89G9fQ= file.zarc
unpacked 32 files
$ time zarc unpack --verify qgsB/WyzVCcTH+DWnpUKnFTY22d7hpHewAyBvyv1SB8= file.zarc
Error: × integrity failure: zarc file digest is puKGv1aG1ANEq7wBxnrJbJ2OPcpBizcG+/sBM89G9fQ=
Command exited with non-zero status 1
0.00user 0.00system 0:00.00elapsed 50%CPU (0avgtext+0avgdata 4536maxresident)k
0inputs+0outputs (0major+199minor)pagefaults 0swaps
Content integrity is per-file; if a Zarc is corrupted but its directory is still readable:
- you can see exactly which files are affected, and
- you can safely unpack intact files.
(not yet implemented)
Paths are stored split into components, not as literal strings.
On Windows a path looks like crates\\cli\\src\\pack.rs
and on Unix a path looks like crates/cli/src/pack.rs
.
Instead of performing path translation, Zarc stores them as an array of components: ["crates", "cli", "src", "pack.rs"]
, so they get interpreted precisely and exactly the same on all platforms.
Of course, some paths aren't Unicode, and Zarc recognises that and stores non-UTF-8 components marked as bytestringsinstead of text.
File and directory (and symlink etc) attributes and extended attributes are stored and restored as possible. You'd think this wouldn't be a feature but hooo boy are many other formats inconsistent on this.
If you want to store custom metadata, there's dedicated support:
(not yet implemented)
$ zarc pack \
-u Created-By "Félix Saparelli" \
-u Rust-Version "$(rustc -Vv)" \
--output meta.zarc filelist
(not yet implemented)
$ zarc pack \
-U one.file Created-By "Félix Saparelli" \
-U 'crates/*/glob' Rust-Version "$(rustc -Vv)" \
--output meta.zarc filelist
(not yet implemented)
Adding more files to a Zarc is done without recreating the entire archive:
$ zarc pack --append --output myfirst.zarc more.files and/folders
If new content duplicates the existing, it won't store new copies. If new files are added that have the same path as existing ones, both the new and old metadata are kept. By default, Zarc will unpack the last version of a path, but you can change that.
Appending to a Zarc keeps metadata about the prior versions for provenance. Zarc stores the insertion date of files and the creation date of the archive itself as well as all prior versions, so you can tell whether a file was appended and when it was created or modified.
Tar is considered to be quite complicated to parse, hard to extend, and implementations are frequently incompatible with each others in subtle ways. A minor goal of Zarc is to specify a format that is relatively simple to parse, work with, and extend.
- Compression is per unique file, so it won't achieve compression gains across similar-but-not-identical files.
In early testing, it's 2–4 times slower at packing than tar+zstd, but yields comparable (±10%) archive sizes. It's 3–10 times faster than Linux's zip, and yields consistently 10-30% smaller archives.
A Node.js's project node_modules
is typically many small and medium files:
$ tree node_modules | wc -l
172572
$ dust -sbn0 node_modules
907M ┌── node_modules
$ find node_modules -type f -printf '%s\\n' | datamash \
max 1 min 1 mean 1 median 1
20905472 0 6134.9564061426 822 # in bytes
$ find node_modules -type l | wc -l
812 # symlinks
$ hyperfine --warmup 2 \
--prepare 'rm node_modules.tar.zst || true' \
'tar -caf node_modules.tar.zst node_modules' \
--prepare 'rm node_modules.zip || true' \
'zip -qr --symlinks node_modules.zip node_modules' \
--prepare 'rm node_modules.zarc || true' \
'zarc pack --output node_modules.zarc node_modules'
Benchmark 1: tar -caf node_modules.tar.zst node_modules
Time (mean ± σ): 7.273 s ± 0.636 s [User: 8.587 s, System: 3.395 s]
Range (min … max): 5.806 s … 8.150 s 10 runs
Benchmark 2: zip -qr --symlinks node_modules.zip node_modules
Time (mean ± σ): 47.042 s ± 2.102 s [User: 40.272 s, System: 6.038 s]
Range (min … max): 44.504 s … 49.788 s 10 runs
Benchmark 3: zarc pack --output node_modules.zarc node_modules
Time (mean ± σ): 11.093 s ± 0.180 s [User: 8.375 s, System: 2.552 s]
Range (min … max): 10.873 s … 11.362 s 10 runs
Summary
'tar -caf node_modules.tar.zst node_modules' ran
1.53 ± 0.14 times faster than 'zarc pack --output node_modules.zarc node_modules'
6.47 ± 0.64 times faster than 'zip -qr --symlinks node_modules.zip node_modules'
$ dust -sbn0 node_modules.tar.zst
189M ┌── node_modules.tar.zst
$ dust -sbn0 node_modules.zip
301M ┌── node_modules.zip
$ dust -sbn0 node_modules.zarc
209M ┌── node_modules.zarc
That same workload, but following/dereferencing symlinks.
$ hyperfine --warmup 2 \
--prepare 'rm node_modules.tar.zst || true' \
'tar -chaf node_modules.tar.zst node_modules' \
--prepare 'rm node_modules.zip || true' \
'zip -qr node_modules.zip node_modules' \
--prepare 'rm node_modules.zarc || true' \
'zarc pack -L --output node_modules.zarc node_modules'
Benchmark 1: tar -chaf node_modules.tar.zst node_modules
Time (mean ± σ): 11.399 s ± 0.899 s [User: 13.156 s, System: 4.591 s]
Range (min … max): 10.369 s … 13.036 s 10 runs
Benchmark 2: zip -qr node_modules.zip node_modules
Time (mean ± σ): 89.879 s ± 3.751 s [User: 79.802 s, System: 8.216 s]
Range (min … max): 84.980 s … 95.516 s 10 runs
Benchmark 3: zarc pack -L --output node_modules.zarc node_modules
Time (mean ± σ): 16.526 s ± 0.380 s [User: 12.961 s, System: 3.340 s]
Range (min … max): 16.146 s … 17.515 s 10 runs
Summary
'tar -chaf node_modules.tar.zst node_modules' ran
1.45 ± 0.12 times faster than 'zarc pack -L --output node_modules.zarc node_modules'
7.88 ± 0.70 times faster than 'zip -qr node_modules.zip node_modules'
$ dust -sbn0 node_modules.tar.zst
431M ┌── node_modules.tar.zst
$ dust -sbn0 node_modules.zip
595M ┌── node_modules.zip
$ dust -sbn0 node_modules.zarc
429M ┌── node_modules.zarc
My personal collection of ebooks: few files, but relatively heavy and tough to compress more.
$ tree ~/Documents/Ebooks | wc -l
54
$ dust -sbn0 ~/Documents/Ebooks
573M ┌── Ebooks
$ find ~/Documents/Ebooks -type f -printf '%s\\n' | datamash \
max 1 min 1 mean 1 median 1
247604768 15116 12028762.56 711323 # in bytes
$ find ~/Documents/Ebooks -type l | wc -l
0 # symlinks
$ hyperfine --warmup 2 \
--prepare 'rm ebooks.tar.zst || true' \
'tar -caf ebooks.tar.zst ~/Documents/Ebooks' \
--prepare 'rm ebooks.zip || true' \
'zip -qr ebooks.zip ~/Documents/Ebooks' \
--prepare 'rm ebooks.zarc || true' \
'zarc pack -L --output ebooks.zarc ~/Documents/Ebooks'
Benchmark 1: tar -caf ebooks.tar.zst ~/Documents/Ebooks
Time (mean ± σ): 2.133 s ± 0.168 s [User: 2.421 s, System: 1.269 s]
Range (min … max): 1.951 s … 2.502 s 10 runs
Benchmark 2: zip -qr ebooks.zip ~/Documents/Ebooks
Time (mean ± σ): 23.859 s ± 1.274 s [User: 22.202 s, System: 1.198 s]
Range (min … max): 21.384 s … 25.397 s 10 runs
Benchmark 3: zarc pack -L --output ebooks.zarc ~/Documents/Ebooks
Time (mean ± σ): 2.014 s ± 0.239 s [User: 1.282 s, System: 0.671 s]
Range (min … max): 1.835 s … 2.576 s 10 runs
Summary
'zarc pack -L --output ebooks.zarc ~/Documents/Ebooks' ran
1.06 ± 0.15 times faster than 'tar -caf ebooks.tar.zst ~/Documents/Ebooks'
11.85 ± 1.54 times faster than 'zip -qr ebooks.zip ~/Documents/Ebooks'
$ dust -sbn0 ebooks.tar.zst
476M ┌── ebooks.tar.zst
$ dust -sbn0 ebooks.zip
477M ┌── ebooks.zip
$ dust -sbn0 ebooks.zarc
478M ┌── ebooks.zarc
$ hyperfine --shell=none --warmup 1 \
'tar tf ebooks.tar.zst' \
'unzip -l ebooks.zip' \
'zarc list-files ebooks.zarc'
Benchmark 1: tar tf ebooks.tar.zst
Time (mean ± σ): 397.0 ms ± 21.5 ms [User: 408.4 ms, System: 629.5 ms]
Range (min … max): 361.1 ms … 429.6 ms 10 runs
Benchmark 2: unzip -l ebooks.zip
Time (mean ± σ): 2.6 ms ± 0.3 ms [User: 1.2 ms, System: 1.2 ms]
Range (min … max): 2.1 ms … 5.1 ms 1018 runs
Benchmark 3: zarc list-files ebooks.zarc
Time (mean ± σ): 2.3 ms ± 0.5 ms [User: 1.3 ms, System: 0.8 ms]
Range (min … max): 1.8 ms … 13.3 ms 1164 runs
Summary
'zarc list-files ebooks.zarc' ran
1.13 ± 0.26 times faster than 'unzip -l ebooks.zip'
173.58 ± 36.29 times faster than 'tar tf ebooks.tar.zst'
- Encryption. Proper secrecy requires hiding both file contents, file metadata, file length, etc. These impose significant design constraints that Zarc is not interested in entertaining. Use full-file encryption over the top, e.g. using age.
- Compatibility with tar or zip. Zarc is a new format, it is not and will never be compatible with zip and tar tooling.
- Splitting. Zarc assumes a single continuous (but not necessarily contiguous on disk) file as its substrate. If you need to split it (why?), do that separately.
-
zarc pack
-
--append
-
-U
and-u
flags to set user metadata -
--follow-symlinks
-
--follow[-and-store]-external-symlinks
-
--level
to set compression level -
--zstd
to set Zstd parameters - Pack linux attributes
- Pack linux xattrs
- Pack linux ACLS
- Pack SELinux attributes
- Pack mac attributes
- Pack mac xattrs
- Pack windows attributes
- Pack windows alternate data stream extended attributes
- Override user/group
- User/group mappings
-
-
zarc debug
-
zarc unpack
- Unpack symlinks
- Unpack linux attributes
- Unpack linux xattrs
- Unpack linux ACLS
- Unpack SELinux attributes
- Unpack mac attributes
- Unpack mac xattrs
- Unpack windows attributes
- Unpack windows alternate data stream extended attributes
- Override user/group
- User/group mappings
-
zarc list-files
-
--stat
— with mode, ownership, size, creation.or(modified) date -
--json
— all the info
-
- Streaming packing
- Streaming unpacking
- Profile and optimise
- Pure rust zstd?
- Seekable files by adding a blockmap (map of file offsets to blocks)?
- Dictionary hash to provide trust that a dictionary on decode is the same as one used on encode
- Bao hashing for streaming verification?