linkedlist unlink segfault

Question

linkedlist unlink segfault

zhemant opened this issue a year ago · comments

Hi,

I am facing a strange segfault and I am trying to understand how is it occurring. I would really appreciate it if someone can provide any leads on how to proceed with debugging and solving this.

I get segfault when timerwheel are trying to unlink the node. The backtrace is as follows:

(gdb) bt
#0  0x00007f41ded839f5 in h2o_linklist_unlink (node=0x7f41d007ddd0) at /builds/h2o/include/h2o/linklist.h:110
#1  cascade_one (ctx=0x7f41d00080d0, wheel=2, slot=27) at /builds/h2o/lib/common/timerwheel.c:278
#2  0x00007f41ded83819 in cascade_all (ctx=0x7f41d00080d0, wheel=<optimized out>) at /builds/h2o/lib/common/timerwheel.c:292
#3  h2o_timerwheel_get_expired (ctx=0x7f41d00080d0, now=<optimized out>, expired=<optimized out>) at /builds/h2o/lib/common/timerwheel.c:334
#4  0x00007f41ded79378 in h2o_evloop_run (loop=0x7f41d0008010, max_wait=2147483647) at /builds/h2o/lib/common/socket/evloop.c.h:932
#5  0x00007f41efeede76 in server_loop (_param=0x7f41d80166d0) at ../server.c:679
#6  0x00007f41f24f9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f41f1d30133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

When I check the values at f1 they look as below:

(gdb) f 1
#1  cascade_one (ctx=0x7f41d00080d0, wheel=2, slot=27) at /builds/h2o/lib/common/timerwheel.c:278
278	        h2o_linklist_unlink(&entry->_link);
(gdb) p *entry
$24 = {_link = {next = 0x0, prev = 0x0}, expire_at = 1689450652767, cb = 0x7f41efef02b0 <resp_timeout_cb>}
(gdb)

As the next and prev is empty, it cases segfault when unlinking.

Valgrind shows something like this:

==3861139== Thread 3:
==3861139== Invalid write of size 8
==3861139==    at 0x185649F5: h2o_linklist_unlink (include/h2o/linklist.h:110)
==3861139==    by 0x185649F5: cascade_one (lib/common/timerwheel.c:278)
==3861139==    by 0x18564870: h2o_timerwheel_get_expired (lib/common/timerwheel.c:321)
==3861139==    by 0x1855A377: h2o_evloop_run (lib/common/socket/evloop.c.h:932)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)
==3861139==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==3861139== 
==3861139== 
==3861139== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3861139==  Access not within mapped region at address 0x8
==3861139==    at 0x185649F5: h2o_linklist_unlink (include/h2o/linklist.h:110)
==3861139==    by 0x185649F5: cascade_one (lib/common/timerwheel.c:278)
==3861139==    by 0x18564870: h2o_timerwheel_get_expired (lib/common/timerwheel.c:321)
==3861139==    by 0x1855A377: h2o_evloop_run (lib/common/socket/evloop.c.h:932)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)

Frederik Deweerdt · Answer 1 · Tue Jul 18 2023 07:55:56 GMT+0800 (China Standard Time)

This looks like custom code calling into libh2o, is this the first valgrind error you see? If so, it might help to increase its freelist --freelist-vol=200000000 (one more zero than the default)

Hemant Zope · Answer 2 · Fri Jul 21 2023 17:57:45 GMT+0800 (China Standard Time)

I am using libh2o, and integrating it in my C code. I am using a old checkout ec805ec5d544a1bfe2be105ae7d46b1b61673d2b

In Valgrind I only see this error. Although there is a long list of some mem leaks those are not related I think as all of them are in my codebase outside of libh2o.

However, I am seeing a lot of these blocks

==3861139== Invalid read of size 8
==3861139==    at 0x184C086B: h2o_send_error_503 (h2o.h:1796)
==3861139==    by 0x184C0826: process_timeout_req_item (server.c:170)
==3861139==    by 0x184C06A9: resp_timeout_cb (server.c:184)
==3861139==    by 0x1855A2DA: h2o_evloop_run (lib/common/socket/evloop.c.h:938)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)
==3861139==  Address 0x262202a8 is 280 bytes inside a block of size 1,176 free'd
==3861139==    at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3861139==    by 0x1859F238: handle_rst_stream_frame (lib/http2/connection.c:1202)
==3861139==    by 0x1859E7C8: expect_default (lib/http2/connection.c:1242)
==3861139==    by 0x1859C4AE: parse_input (lib/http2/connection.c:1285)
==3861139==    by 0x1859C4AE: on_read (lib/http2/connection.c:1327)
==3861139==    by 0x1855EC5A: read_on_ready (lib/common/socket/evloop.c.h:366)
==3861139==    by 0x1855EC5A: run_socket (lib/common/socket/evloop.c.h:834)
==3861139==    by 0x1855A269: run_pending (lib/common/socket/evloop.c.h:876)
==3861139==    by 0x1855A269: h2o_evloop_run (lib/common/socket/evloop.c.h:925)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)
==3861139==  Block was alloc'd at
==3861139==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3861139==    by 0x185A54D0: h2o_mem_alloc (include/h2o/memory.h:441)
==3861139==    by 0x185A54D0: h2o_http2_stream_open (lib/http2/stream.c:39)
==3861139==    by 0x1859ED05: handle_headers_frame (lib/http2/connection.c:999)
==3861139==    by 0x1859E7C8: expect_default (lib/http2/connection.c:1242)
==3861139==    by 0x1859C4AE: parse_input (lib/http2/connection.c:1285)
==3861139==    by 0x1859C4AE: on_read (lib/http2/connection.c:1327)
==3861139==    by 0x1855EC5A: read_on_ready (lib/common/socket/evloop.c.h:366)
==3861139==    by 0x1855EC5A: run_socket (lib/common/socket/evloop.c.h:834)
==3861139==    by 0x1855A269: run_pending (lib/common/socket/evloop.c.h:876)
==3861139==    by 0x1855A269: h2o_evloop_run (lib/common/socket/evloop.c.h:925)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)

I believe this block is arising due to an inefficient client side but I might be wrong, I am using curl and instead of maintaining an HTTP2 connection, I am opening a new connection for sending requests. Which might be resulting in go-away frames and cause the above behaviour. I also have logging at process_timeout_req_item so show when a timeout occurs. But this is not usually executed.

Hemant Zope · Answer 3 · Sun Mar 03 2024 19:48:19 GMT+0800 (China Standard Time)

The issue was due to missing responses in async responses. With timeout, added always response sending in case of missing or error case.