Crash while add folder to existing image

Question

Crash while add folder to existing image

screwer opened this issue 2 years ago · comments

OS is the Ubuntu 20.04 with all updates installed.
I can reproduce the following crash reliably, i.e. it doesn't feel like a race.
It seems the destination image is not changing during the processing or crashing (maybe crash occured in some preparation steps). "Seems" - because i've only checked once, so no guarantees.
The destination image is valid and it can be mounted just fine. All content is looking correct.
The destination image was produced iteratively, through a series of adding folders. But it gets stuck after some time on a random folder. I reproduced it a few times from scratch. But the process is very time consuming, It takes about a week on 72 CPUs.
The destination image is huge (~50GB), the source folders are as well (~10GB each). They all have small differences between them.

I've got quite a few crash reports. Here is the latest two: crash_info.zip

screwer · Answer 1 · Mon Sep 26 2022 21:40:22 GMT+0800 (China Standard Time)

Debug build with a bit more detailed crash information.
It seems "read_block" receives invalid pointer as "block" buffer address.

--Type for more, q to quit, c to continue without paging--

Thread 1 "mksquashfs" received signal SIGSEGV, Segmentation fault.
0x00007ffff7e10466 in lzo1x_decompress_safe () from /lib/x86_64-linux-gnu/liblzo2.so.2
(gdb) bt
#0 0x00007ffff7e10466 in lzo1x_decompress_safe () from /lib/x86_64-linux-gnu/liblzo2.so.2
#1 0x000055555559066e in lzo_uncompress (dest=0x7ffb7d91e010, src=0x7fffffffbbc0, size=3221, outsize=8192, error=0x7fffffffc888) at lzo_wrapper.c:384
#2 0x0000555555574d48 in compressor_uncompress (comp=0x5555555aa1a0 <lzo_comp_ops>, dest=0x7ffb7d91e010, src=0x7fffffffbbc0, size=3221, block_size=8192, error=0x7fffffffc888) at compressor.h:66
#3 0x0000555555574f68 in read_block (fd=3, start=50959549582, next=0x7fffffffc970, expected=0, block=0x7ffb7d91e010) at read_fs.c:79
#4 0x000055555557522a in scan_inode_table (fd=3, start=50959549582, end=51372219070, root_inode_start=51372217571, root_inode_offset=2678, sBlk=0x55555584a940 , dir_inode=0x7fffffffcc10, root_inode_block=0x7fffffffcbd
root_inode_size=0x7fffffffcd34, uncompressed_file=0x5555555aa388 <total_bytes>, uncompressed_directory=0x5555555aa3a8 <total_directory_bytes>, file_count=0x5555555aa360 <file_count>, sym_count=0x5555555aa364 <sym_count>,
dev_count=0x5555555aa368 <dev_count>, dir_count=0x5555555aa36c <dir_count>, fifo_count=0x5555555aa370 <fifo_count>, sock_count=0x5555555aa374 <sock_count>, id_table=0x555555bfed00) at read_fs.c:159
#5 0x00005555555776ad in read_filesystem (root_name=0x0, fd=3, sBlk=0x55555584a940 , cinode_table=0x5555555aa3c0 <inode_table>, data_cache=0x5555555aa3e0 <data_cache>, cdirectory_table=0x5555555aa390 <directory_table>
directory_data_cache=0x5555555aa3b0 <directory_data_cache>, last_directory_block=0x7fffffffcd28, inode_dir_offset=0x7fffffffcd2c, inode_dir_file_size=0x7fffffffcd30, root_inode_size=0x7fffffffcd34,
inode_dir_start_block=0x7fffffffcd38, file_count=0x5555555aa360 <file_count>, sym_count=0x5555555aa364 <sym_count>, dev_count=0x5555555aa368 <dev_count>, dir_count=0x5555555aa36c <dir_count>,
fifo_count=0x5555555aa370 <fifo_count>, sock_count=0x5555555aa374 <sock_count>, uncompressed_file=0x5555555aa388 <total_bytes>, uncompressed_inode=0x5555555aa3d8 <total_inode_bytes>,
uncompressed_directory=0x5555555aa3a8 <total_directory_bytes>, inode_dir_inode_number=0x7fffffffcd3c, inode_dir_parent_inode=0x7fffffffcd40, push_directory_entry=0x555555567aa3 <add_old_root_entry>,
fragment_table=0x5555555aa430 <fragment_table>, inode_lookup_table=0x5555555aa3f8 <inode_lookup_table>) at read_fs.c:1008
#6 0x0000555555574311 in main (argc=9, argv=0x7fffffffe078) at mksquashfs.c:8128
(gdb) display/i $pc
1: x/i $pc
=> 0x7ffff7e10466 <lzo1x_decompress_safe+118>: mov %r8b,(%rdx,%rax,1)
(gdb) i r
rax 0x0 0
rbx 0x8 8
rcx 0x7fffffffbb60 140737488337760
rdx 0x7ffb7d91e010 140718120230928
rsi 0x8 8
rdi 0x7fffffffbbc0 140737488337856
rbp 0x2000 0x2000
rsp 0x7fffffffbb08 0x7fffffffbb08
r8 0x12 18
r9 0x7fffffffc855 140737488341077
r10 0x7ffb7d920010 140718120239120
r11 0x7fffffffbbc1 140737488337857
r12 0xc95 3221
r13 0x0 0
r14 0xc95 3221
r15 0x0 0
rip 0x7ffff7e10466 0x7ffff7e10466 <lzo1x_decompress_safe+118>
eflags 0x10202 [ IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb) x/64xb $rdx
0x7ffb7d91e010: Cannot access memory at address 0x7ffb7d91e010
(gdb)

Phillip Lougher · Answer 2 · Tue Sep 27 2022 13:56:12 GMT+0800 (China Standard Time)

Thanks for that detailed information, as it has given me something concrete to work with.

Looking at the Ubuntu packages website, the latest squashfs-tools version for Ubuntu 20.04 appears to be squashfs-tools (1:4.4-1ubuntu0.3). I have downloaded the source code for that, and it is a vanilla Squashfs-tools 4.4 release with some additional patches to fix some CVEs and some minor issues, none of which touch the code in question here.

You could verify that by running "mksquashfs -version" and reporting the result :-)

I have eye-balled the code in question, and it doesn't appear to have any bugs ... The code in question is conservatively coded and very straight-forward.

Additionally, I have checked source control history, and this code is essentially unchanged since the earliest source control commit in 2005-11-18, which is almost 17 years ago. Also I know the code dates from around 2003.

This is by definition not new code, and any bugs in it will have long since shown up. But nothing has been reported for almost 20 years.

In fact you're the only person to have previously reported issues with similar code in Unsquashfs, back in 2020.

#100

There you also report low performance and crashes.

As a result I optimised the code, and introduced a guestimate for the overall size of the inode table, to reduce the total amount of reallocs to around 2 - 8, rather than the millions previously on very large filesystems.

See commit 88bcbfa

I also optimised the code in question here, doing exactly the same thing, to reduce the total amount of rellocs to around 2 - 8.

See commit 1c6f9df

So this issue is already fixed, and it is present in Squashfs-tools 4.5 and later.

I can only speculate that the millions of reallocs in the earlier code is triggering an underlying bug which very occasionally results in an invalid pointer being returned.

But, as I said, this issue is already fixed in later version of squashfs-tools. Whether distributions want to pick this fix up, either as a cherry pick, or updating to a later version is entirely their decision.

screwer · Answer 3 · Tue Sep 27 2022 14:54:21 GMT+0800 (China Standard Time)

Looking at the Ubuntu packages website, the latest squashfs-tools version for Ubuntu 20.04 appears to be squashfs-tools (1:4.4-1ubuntu0.3)

No matter which source packaged in Ubuntu, because i've using tools builded from actual github source.
Checked r n, my master on commit 7647722

Would check with any additional tests you're want to do. Reproducing of the issue is pretty stable. But require a huge files with some private data, i can't to share.

There you also report low performance and crashes.

You may have forgotten. That time i've provided my own PR which fix the issue. You rewrote it in your way. And yes - issue with perfomance degrade, due to heap fragmentation by too often memory allocation via too small chunks, was fixed.

screwer · Answer 4 · Tue Sep 27 2022 23:57:19 GMT+0800 (China Standard Time)

It's just an integer overflow in scan_inode_table function.

Environment:
start = 49940229060
end = 51372219070
alloc_size = 1431994368
bytes = 2863988736
size = 1015808

Easy to see, we did three reallocs:
Initially size = 0
After first one "size" was increased by "alloc_size" value, and size value become 0x555A'8000.
After second "size" again was increased by "alloc_size" value, ant size value become 0xAAB50000.
After third addition "size" value become 1'000F'8000 which effectively overflows 32-bit width, and size turns into "0xF'8000"

Phillip Lougher · Answer 5 · Wed Sep 28 2022 00:40:45 GMT+0800 (China Standard Time)

That's not the filesystem you gave earlier, where start=50959549582, end=51372219070

e.g. from back trace

#4 0x000055555557522a in scan_inode_table (fd=3, start=50959549582, end=51372219070, root_inode_start=51372217571, root_inode_offset=2678, sBlk=0x55555584a940 , dir_inode=0x7fffffffcc10, root_inode_block=0x7fffffffcbd
root_inode_size=0x7fffffffcd34, uncompressed_file=0x5555555aa388 <total_bytes>, uncompressed_directory=0x5555555aa3a8 <total_directory_bytes>, file_count=0x5555555aa360 <file_count>, sym_count=0x5555555aa364 <sym_count>,
dev_count=0x5555555aa368 <dev_count>, dir_count=0x5555555aa36c <dir_count>, fifo_count=0x5555555aa370 <fifo_count>, sock_count=0x5555555aa374 <sock_count>, id_table=0x555555bfed00) at read_fs.c:159

Doesn't matter easily fixed.

screwer · Answer 6 · Wed Sep 28 2022 01:20:51 GMT+0800 (China Standard Time)

That's not the filesystem you gave earlier

I had not a single repoducing image. Last time i choosen smallest one, to have a faster backup/restore if it been damaged.