mksquashfs crashes during Fedora installer image build

Question

mksquashfs crashes during Fedora installer image build

AdamWill opened this issue 2 years ago · comments

I'm afraid I don't have full details on this yet (it's very late and I'm filing it in a hurry), but downstream comments indicate a release is imminent so I thought I'd better get a report in just in case it can prevent the release happening with potentially a bad bug.

Fedora downstream testing of a new squashfs-tools build that's based on current git HEAD - 36abab0 - show a consistent crash when an mksquashfs runs as part of a Fedora installer image build process. See https://bugzilla.redhat.com/show_bug.cgi?id=2178510 . Unfortunately the test didn't produce a coredump because it exceeded the default resource limits; in the morning (if nobody else has got one by then) I'll either reproduce manually or tweak the test to disable the resource limits so we get a coredump.

Phillip Lougher · Answer 1 · Wed Mar 15 2023 15:58:24 GMT+0800 (China Standard Time)

OK, sorry about that.

Do you know the last version that didn't produce the crash? That would help narrow down what commits caused the problem.

Alternately, if you could give me a reproducer, or test on an earlier version that would be helpful.

Phillip Lougher · Answer 2 · Wed Mar 15 2023 17:03:32 GMT+0800 (China Standard Time)

hmm, the errors reported -11 and 4 don't appear to be mean much. Interpreted as system call errors (errno) they're EAGAIN and EINTR. Perhaps they'd be more meaningful if I knew how to interpret them.

A web-browse shows there is an earlier recent Squashfs-tools based off git squashfs-tools-4.6-0.2.20230323git7cf6cee.fc39. Has that been known to work? There has been very little done since then except documentation (tedious but necessary) and a couple of annoying minor memory leaks showing up in latest testing.

It is very late here (I was about to go to bed), and I can't do anything more now. If there's been no developments once I get up, I'll see if I can install latest Rawhide.

Phillip Lougher · Answer 3 · Wed Mar 15 2023 17:41:30 GMT+0800 (China Standard Time)

Interpreting the error codes as a signal makes more sense, although there's no explicit indication in the log that a signal was delivered. 4 is SIGILL and 11 is SIGSEGV.

Adam Williamson · Answer 4 · Wed Mar 15 2023 23:42:06 GMT+0800 (China Standard Time)

Yes, it's a signal 11 crash (negative return codes indicate signals, by convention).

The most recent build before this was actually even newer, it was squashfs-tools-4.6-0.5.20230312gitaaf011a.fc39 (i.e. aaf011a ). That one passed tests. So the cause should be one of the six commits since then.

I will try and get a coredump today, or failing that, I'll try and bisect it.

Adam Williamson · Answer 5 · Thu Mar 16 2023 02:54:54 GMT+0800 (China Standard Time)

Here's the backtrace

Adam Williamson · Answer 6 · Thu Mar 16 2023 03:05:15 GMT+0800 (China Standard Time)

That makes 83b2f3a look like the suspect. I'll try reverting it.

Adam Williamson · Answer 7 · Thu Mar 16 2023 03:15:26 GMT+0800 (China Standard Time)

Could the problem be that j isn't actually available outside the for loop? i comes from outside the for loop and so is obviously still available once we've finished the loop, but j is part of the loop. I'm no C expert, but Google tells me that when you set a variable in a for loop like this, it's not available outside it: https://stackoverflow.com/questions/7880658/what-is-the-scope-of-a-while-and-for-loop

Adam Williamson · Answer 8 · Thu Mar 16 2023 03:30:16 GMT+0800 (China Standard Time)

OTOH, I don't understand why before the change, we were doing xattr_list[i - 1].vnext = NULL; not xattr_list[i].vnext = NULL;?

edit: duh. yes I do. when e.g. i is 5, on the last iteration of the loop where we actually do the stuff, j is 4, and we set xattr_list[3].vnext to xattr_list[4]. So once we're finished, we need to set xattr_list[4].vnext - not xattr_list[5].vnext - to NULL. String indexes start at 0, i is a count from 1. Duh.

You actually did the exact opposite change - from xattr_list[j].vnext = NULL; to xattr_list[i - 1].vnext = NULL; in c5db34e in September, to "fix out of bounds access", which sounds exactly like this bug. Maybe reverting that in this commit was simply inadvertent? You were working from an old version of this file or something?

Adam Williamson · Answer 9 · Thu Mar 16 2023 03:56:38 GMT+0800 (China Standard Time)

#231fixes this in a local test.

Phillip Lougher · Answer 10 · Thu Mar 16 2023 06:11:19 GMT+0800 (China Standard Time)

You actually did the exact opposite change - from xattr_list[j].vnext = NULL; to xattr_list[i - 1].vnext = NULL; in c5db34e in September, to "fix out of bounds access", which sounds exactly like this bug. Maybe reverting that in this commit was simply inadvertent? You were working from an old version of this file or something?

Probably something like that. When fixing small memory leaks I like to look at previous versions to understand when and why the leak occurred.

Usual problems, the release is already very late, and I keep on finding more things to do.

Proves the old proverb "more haste less speed" very true.