htop crashed

Question

htop crashed

matthiaskrgr opened this issue 3 months ago · comments

I found this, unfortunately I did not see it crash life, I think the htop had been running for a couple of days:

at 12:19:38 ❯ htop


FATAL PROGRAM ERROR DETECTED
============================
Please check at https://htop.dev/issues whether this issue has already been reported.
If no similar issue has been reported before, please create a new issue with the following information:
  - Your htop version: '3.3.0'
  - Your OS and kernel version (uname -a)
  - Your distribution and release (lsb_release -a)
  - Likely steps to reproduce (How did it happen?)
  - Backtrace of the issue (see below)

Error information:
------------------
A signal 7 (Bus error) was received.

Setting information:
--------------------
htop_version=3.3.0;config_reader_min_version=3;fields=0 48 17 18 38 39 40 2 46 47 49 1;hide_kernel_threads=1;hide_userland_threads=0;hide_running_in_container=0;shadow_other_users=0;show_thread_names=0;show_program_path=1;highlight_base_name=0;highlight_deleted_exe=0;shadow_distribution_path_prefix=0;highlight_megabytes=1;highlight_threads=0;highlight_changes=0;highlight_changes_delay_secs=5;find_comm_in_cmdline=1;strip_exe_from_cmdline=1;show_merged_command=0;header_margin=0;screen_tabs=1;detailed_cpu_time=0;cpu_count_from_one=1;show_cpu_usage=1;show_cpu_frequency=1;show_cpu_temperature=1;degree_fahrenheit=0;update_process_names=0;account_guest_in_cpu_meter=0;color_scheme=0;enable_mouse=1;delay=2;hide_function_bar=0;header_layout=two_50_50;column_meters_0=AllCPUs Memory Swap;column_meter_modes_0=1 1 1;column_meters_1=Tasks LoadAverage Uptime Battery NetworkIO DiskIO CPU;column_meter_modes_1=2 2 2 1 1 1 1;tree_view=0;sort_key=46;tree_sort_key=0;sort_direction=-1;tree_sort_direction=1;tree_view_always_by_pid=1;all_branches_collapsed=1;screen:Main=PID USER PRIORITY NICE M_VIRT M_RESIDENT M_SHARE STATE PERCENT_CPU PERCENT_MEM TIME Command;.sort_key=PERCENT_CPU;.tree_sort_key=PID;.tree_view_always_by_pid=1;.tree_view=0;.sort_direction=-1;.tree_sort_direction=1;.all_branches_collapsed=1;screen:I/O=PID USER IO_PRIORITY IO_RATE IO_READ_RATE IO_WRITE_RATE PERCENT_SWAP_DELAY PERCENT_IO_DELAY Command;.sort_key=IO_RATE;.tree_sort_key=PID;.tree_view_always_by_pid=0;.tree_view=0;.sort_direction=-1;.tree_sort_direction=1;.all_branches_collapsed=0;

Backtrace information:
----------------------
htop(+0x140ec)[0x56549d84f0ec]
htop(CRT_handleSIGSEGV+0xf2)[0x56549d856bc2]
/usr/lib/libc.so.6(+0x40770)[0x7f36c734c770]
/usr/lib/libc.so.6(_IO_file_doallocate+0x0)[0x7f36c7385e90]
/usr/lib/libc.so.6(_IO_doallocbuf+0x49)[0x7f36c7394c89]
/usr/lib/libc.so.6(_IO_file_underflow+0x225)[0x7f36c7392a65]
/usr/lib/libc.so.6(_IO_default_uflow+0x2f)[0x7f36c7394d2f]
/usr/lib/libc.so.6(+0x63c60)[0x7f36c736fc60]
/usr/lib/libc.so.6(__isoc23_fscanf+0x9d)[0x7f36c736404d]
htop(+0x43874)[0x56549d87e874]
htop(Machine_scanTables+0x9e)[0x56549d85a89e]
htop(ScreenManager_run+0x671)[0x56549d86e731]
htop(CommandLine_run+0x87d)[0x56549d85750d]
/usr/lib/libc.so.6(+0x29cd0)[0x7f36c7335cd0]
/usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7f36c7335d8a]
htop(_start+0x25)[0x56549d84d065]

To make the above information more practical to work with, please also provide a disassembly of your htop binary. This can usually be done by running the following command:

   objdump -d -S -w `which htop` > ~/htop.objdump

Please include the generated file in your report.
Running this program with debug symbols or inside a debugger may provide further insights.

Thank you for helping to improve htop!

[1]    156173 bus error  htop
~ took 2d45m4s

Matthias Krüger · Answer 1 · Thu Mar 07 2024 02:39:08 GMT+0800 (China Standard Time)

objdump: Warning: source file /usr/include/bits/unistd.h is more recent than object file
objdump: Warning: source file /usr/include/bits/stdio2.h is more recent than object file
objdump: Warning: source file /usr/include/bits/string_fortified.h is more recent than object file
objdump: Warning: source file /usr/include/bits/stdlib.h is more recent than object file
objdump: Warning: source file /usr/include/stdlib.h is more recent than object file
objdump: Warning: source file /usr/include/bits/fcntl2.h is more recent than object file
objdump: Warning: source file /usr/include/wchar.h is more recent than object file
objdump: Warning: source file /usr/include/sys/sysmacros.h is more recent than object file
htop.objdump.zip

Matthias Krüger · Answer 2 · Thu Mar 07 2024 02:43:27 GMT+0800 (China Standard Time)

Hmm, maybe it was OOMkilled?

[240333.375663] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-2.scope,task=rustc,pid=2467259,uid=1000
[240333.375691] Out of memory: Killed process 2467259 (rustc) total-vm:2351628kB, anon-rss:1788928kB, file-rss:256kB, shmem-rss:0kB, UID:1000 pgtables:4240kB oom_score_adj:0
[240334.464703] htop invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[240334.464715] CPU: 9 PID: 173079 Comm: htop Not tainted 6.6.16-2-MANJARO #1 1233f3fcf521bb3b79a62ab2032e7c268e70bb52
[240334.464721] Hardware name: LENOVO 21A00071GE/21A00071GE, BIOS R1MET44W (1.14 ) 01/21/2022
[240334.464724] Call Trace:
[240334.464727]  <TASK>
[240334.464733]  dump_stack_lvl+0x47/0x60
[240334.464742]  dump_header+0x4a/0x240
[240334.464750]  oom_kill_process+0xf9/0x190
[240334.464756]  out_of_memory+0x246/0x590
[240334.464762]  __alloc_pages_slowpath.constprop.0+0xaf6/0xdb0
[240334.464775]  __alloc_pages+0x32d/0x350
[240334.464783]  folio_alloc+0x1b/0x50
[240334.464789]  __filemap_get_folio+0x13e/0x2e0
[240334.464797]  filemap_fault+0x153/0xb60
[240334.464807]  __do_fault+0x33/0x130
[240334.464813]  do_fault+0x29b/0x450
[240334.464819]  __handle_mm_fault+0x796/0xd90
[240334.464831]  handle_mm_fault+0x17f/0x360
[240334.464836]  do_user_addr_fault+0x1ed/0x660
[240334.464844]  exc_page_fault+0x7f/0x180
[240334.464851]  asm_exc_page_fault+0x26/0x30
[240334.464856] RIP: 0033:0x7f334255369d
[240334.464899] Code: Unable to access opcode bytes at 0x7f3342553673.
[240334.464901] RSP: 002b:00007fff77f05500 EFLAGS: 00010246
[240334.464906] RAX: 0000000000000000 RBX: 00007f33426d53c0 RCX: 000000000000002e
[240334.464910] RDX: 00007f3342695b1a RSI: 00007f33426d53c0 RDI: 00007f3342695b1a
[240334.464912] RBP: 0000000000000000 R08: 0000000000000046 R09: 0000000000000000
[240334.464915] R10: 0000000000000000 R11: 00005586fc1df6fe R12: 00007f3342697813
[240334.464917] R13: 00007fff77f056a0 R14: 00007f3342695b1a R15: 00007fff77f05690
[240334.464927]  </TASK>
[240334.464929] Mem-Info:
[240334.464932] active_anon:6210609 inactive_anon:534797 isolated_anon:0
                 active_file:221 inactive_file:1100 isolated_file:0
                 unevictable:8 dirty:442 writeback:0
                 slab_reclaimable:133088 slab_unreclaimable:105270
                 mapped:776 shmem:171470 pagetables:17369
                 sec_pagetables:0 bounce:0
                 kernel_misc_reclaimable:0
                 free:44445 free_pcp:157 free_cma:0

I probably ran a lot of memory-intensive processes at that time, but still makes me wonder why htop would get selected to be oomkilled

Faster IT · Answer 3 · Thu Mar 07 2024 02:56:05 GMT+0800 (China Standard Time)

Your Rust compiler was OOM-killed (rustc) but htop got a signal 7 from libc, looks like fscanf lost its target midway.
@BenBE wanna dissect the objdump for which function bailed out?

BenBE · Answer 4 · Thu Mar 07 2024 02:57:35 GMT+0800 (China Standard Time)

Your Rust compiler was OOM-killed (rustc) but htop got a signal 7 from libc, looks like fscanf lost its target midway.
@BenBE wanna dissect the objdump for which function bailed out?

Would be nice. Would at least allow to harden that location.

Matthias Krüger · Answer 5 · Thu Mar 07 2024 03:18:20 GMT+0800 (China Standard Time)

Your Rust compiler was OOM-killed (rustc) but htop got a signal 7 from libc, looks like fscanf lost its target midway.

ouff 😓

I was fuzzing rustc so it makes sense it OOMd, but it also looks like the oomkiller went a bit haywire ( excluded rustc from these entries):

Mär 06 13:01:42 p14s kernel: htop invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:02:19 p14s kernel: sound_pulseaudi invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:02:23 p14s kernel: gdbus invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:02:23 p14s kernel: systemd-journal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-250
Mär 06 13:02:23 p14s kernel: htop invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:02:24 p14s kernel: treereduce-rust invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:02:27 p14s kernel: pulseaudio invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=200
Mär 06 13:02:35 p14s kernel: rustup invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:02:57 p14s kernel: PTY reader invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:05:03 p14s kernel: htop invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:05:08 p14s kernel: htop invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:11:03 p14s kernel: treereduce-rust invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:15:34 p14s kernel: treereduce-rust invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:23:46 p14s kernel: treereduce-rust invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:24:52 p14s kernel: notify-rs poll  invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:24:53 p14s kernel: i3status-rs invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:26:15 p14s kernel: treereduce-rust invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:34:35 p14s kernel: htop invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:34:39 p14s kernel: treereduce-rust invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Mär 06 13:35:12 p14s kernel: treereduce-rust invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0

so I guess "xy invoked oom killer" does not mean, "xy tried to reserve 30 gb memory" but rather "xy tried to access additional memory, which we didn't have, so oom killer killed something else" ?

BenBE · Answer 6 · Thu Mar 07 2024 07:10:20 GMT+0800 (China Standard Time)

That crash is kinda interesting in a way … Looking at the dump I could locate the crash site to be the following portion of the code:

static bool LinuxProcessTable_readStatmFile(LinuxProcess* process, openat_arg_t procFd, const LinuxMachine* host) {
   FILE* statmfile = fopenat(procFd, "statm", "r");
   if (!statmfile)
      return false;

   long int dummy, dummy2;

   // \/------------- Crash happens inside this call
   int r = fscanf(statmfile, "%ld %ld %ld %ld %ld %ld %ld",
                  &process->super.m_virt,
                  &process->super.m_resident,
                  &process->m_share,
                  &process->m_trs,
                  &dummy, /* unused since Linux 2.6; always 0 */
                  &process->m_drs,
                  &dummy2); /* unused since Linux 2.6; always 0 */
   fclose(statmfile);

   …
}

What's interesting about this, is that at this point htop already established that the file exists AND has a valid handle on it (if his wasn't the case, the fdopenat would have failed and the check for statmfile would have returned false to our caller).

(The debug symbols in the object dump simplified stuff quite a bit: Only took me subtracting two offsets and applying that to another address given in your backtrace).

Now for the really thrilling part of this: Following up on the trace we notice the calls inside fscanf AKA __isoc23_fscanf. Given that the file did not have read any data yet, we end up in _IO_file_underflow, which proceeds to allocate the internal buffers for reading. Thus far normal. Unusual is though, that there's an entry _IO_file_doallocate+0x0, which indicates a return address to that function at its start address: The fun thing is, this can't happen with a normal call instruction as this would leave the return address inside the function. That frame on the stack thus was placed there by other means like some tail unwinding from the compiler when libc was build.

The next call in is somewhere entirely else inside libc, referring to some syscall; I'd suspect malloc wasn't called here, as that should leave the stack in a bit of a different state. But mmap may be a candidate.

We now have two options to proceed:

If you have an objdump of the libc at that time with debug symbols we can follow up that mystery location and determine exactly, which instruction caused the SIGBUS.
Harden htop regarding this routine and replace the fscanf by xReadFile+sscanf instead.

Depending on the actual cause inside libc the second option may not be a final fix, but it at least would help us get more control over things, provided read itself does not crash …

Matthias Krüger · Answer 7 · Thu Mar 07 2024 07:24:30 GMT+0800 (China Standard Time)

If you have an objdump of the libc at that time with debug symbols

how can I check this? 😅

BenBE · Answer 8 · Thu Mar 07 2024 07:30:24 GMT+0800 (China Standard Time)

If you haven't done a system update of your libc since that crash happened, chances are, just running objdump with the same arguments as for htop should do the trick. Given we have several symbols of libc in the backtrace, we should be able to align for any differences if necessary.

Matthias Krüger · Answer 9 · Thu Mar 07 2024 07:34:34 GMT+0800 (China Standard Time)

libc.objdump.zip
hope this helps :)

Matthias Krüger · Answer 10 · Thu Mar 07 2024 07:36:44 GMT+0800 (China Standard Time)

just for fun I built htop with asan+ubsan and tried to execute that binary with prlimit --as=... to artificially limit the memory available to it (in order to simulate OOM) but it looks like asan bins do not really like prlimit :/

Faster IT · Answer 11 · Thu Mar 07 2024 17:56:31 GMT+0800 (China Standard Time)

Once we are past this interesting investigation ... htop should probably handle SIGBUS (and the commonly used SIGUSR) to mean "restart anew". No need for htop to abort on that, just need to make sure the data structures are reset as - in this case - a read has come short and made libc bail out.

BenBE · Answer 12 · Thu Mar 07 2024 18:01:46 GMT+0800 (China Standard Time)

@fasterit I'm still looking into what actually caused this SIGBUS instead the expected SIGSEGV in this case. Also if libc bailed normally, wouldn't you expect to get SIGABORT instead?

Apart from that, the better and simpler solution for this case probably is to just replace the libc fscanf by sscanf on a (stack) buffer we control. That way you don't have any blackbox buffer voodoo going on behind the scenes.

Faster IT · Answer 13 · Thu Mar 07 2024 18:06:49 GMT+0800 (China Standard Time)

You get SIGABRT if glibc detected an error (fencing code etc.) and SIGSEGV when you access unallocated memory. But if you do that access unaligned (regardless of whether the address is in allocated memory space or not), you get SIGBUS. That one (typically) fires first.

BenBE · Answer 14 · Fri Mar 08 2024 18:15:59 GMT+0800 (China Standard Time)

Given that glibc does dynamic memory management for its file API I did a small patch in #1409, that substitutes the glibc voodoo by xReadfileat, which can operate on static buffers instead.

cgzones · Answer 15 · Thu Mar 28 2024 04:34:46 GMT+0800 (China Standard Time)

Should be resolved via 2b7f4a6

Matthias Krüger · Answer 16 · Thu Mar 28 2024 04:59:04 GMT+0800 (China Standard Time)

thank you 😄