Quadratic complexity when reading large entries
nigeltao opened this issue · comments
Make a .tar
archive containing a single large file, archivemount
it and copy out the single element in the archive:
$ truncate --size=256M zeroes
$ tar cf zeroes-256mib.tar zeroes
$ archivemount zeroes-256mib.tar ~/mnt
$ dd if=~/mnt/zeroes of=/dev/null status=progress
267633152 bytes (268 MB, 255 MiB) copied, 77 s, 3.5 MB/s
524288+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 77.5176 s, 3.5 MB/s
Notice that the "MB/s" dd
throughput slows down (in the interactive output, not copy/pasted above) as time goes on. It looks like there's some sort of super-linear algorithm involved, and the whole thing takes more than a minute. In comparison, a straight tar
extraction takes less than a second.
$ time tar xf zeroes-256mib.tar
real 0m0.216s
Sprinkling some logging in the _ar_read
function in archivemount.c
shows that the dd
leads to multiple _ar_read
calls. In the steady state, it reads size = 128 KiB
each time, with the offset
argument incrementing on each call.
Inside that _ar_read
function, if we take the if (node->modified)
false branch, then for each call:
- we call
archive_read_new
, even though it's the same archive every time. - we iterate from the start of that archive until we find the
archive_entry
for the file. Again, this is once per call, even if it's conceptually the samearchive_entry
object used in previous calls. - we copy out
offset
bytes from the start of the entry's contents to a 'trash' buffer, before finally copying out the payload that_ar_read
is actually interested in.
The total number of bytes produced by archive_read_data
calls is therefore quadratic in the decompressed size of the archive entry. This is slow enough for .tar
files but probably worse for .tar.gz
files. The whole thing is reminiscent of the Shlemiel the Painter story.
There may be some complications if we're re-writing the archive, but when mounting read-only, we should be able to get much better performance.
https://github.com/google/fuse-archive
is an alternative implementation with linear (not quadratic) complexity.