h2o / h2o

H2O - the optimized HTTP/1, HTTP/2, HTTP/3 server

Home Page:https://h2o.examp1e.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

linkedlist unlink segfault

zhemant opened this issue · comments

Hi,

I am facing a strange segfault and I am trying to understand how is it occurring. I would really appreciate it if someone can provide any leads on how to proceed with debugging and solving this.

I get segfault when timerwheel are trying to unlink the node. The backtrace is as follows:

(gdb) bt
#0  0x00007f41ded839f5 in h2o_linklist_unlink (node=0x7f41d007ddd0) at /builds/h2o/include/h2o/linklist.h:110
#1  cascade_one (ctx=0x7f41d00080d0, wheel=2, slot=27) at /builds/h2o/lib/common/timerwheel.c:278
#2  0x00007f41ded83819 in cascade_all (ctx=0x7f41d00080d0, wheel=<optimized out>) at /builds/h2o/lib/common/timerwheel.c:292
#3  h2o_timerwheel_get_expired (ctx=0x7f41d00080d0, now=<optimized out>, expired=<optimized out>) at /builds/h2o/lib/common/timerwheel.c:334
#4  0x00007f41ded79378 in h2o_evloop_run (loop=0x7f41d0008010, max_wait=2147483647) at /builds/h2o/lib/common/socket/evloop.c.h:932
#5  0x00007f41efeede76 in server_loop (_param=0x7f41d80166d0) at ../server.c:679
#6  0x00007f41f24f9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007f41f1d30133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

When I check the values at f1 they look as below:

(gdb) f 1
#1  cascade_one (ctx=0x7f41d00080d0, wheel=2, slot=27) at /builds/h2o/lib/common/timerwheel.c:278
278	        h2o_linklist_unlink(&entry->_link);
(gdb) p *entry
$24 = {_link = {next = 0x0, prev = 0x0}, expire_at = 1689450652767, cb = 0x7f41efef02b0 <resp_timeout_cb>}
(gdb) 

As the next and prev is empty, it cases segfault when unlinking.

Valgrind shows something like this:

==3861139== Thread 3:
==3861139== Invalid write of size 8
==3861139==    at 0x185649F5: h2o_linklist_unlink (include/h2o/linklist.h:110)
==3861139==    by 0x185649F5: cascade_one (lib/common/timerwheel.c:278)
==3861139==    by 0x18564870: h2o_timerwheel_get_expired (lib/common/timerwheel.c:321)
==3861139==    by 0x1855A377: h2o_evloop_run (lib/common/socket/evloop.c.h:932)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)
==3861139==  Address 0x8 is not stack'd, malloc'd or (recently) free'd
==3861139== 
==3861139== 
==3861139== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3861139==  Access not within mapped region at address 0x8
==3861139==    at 0x185649F5: h2o_linklist_unlink (include/h2o/linklist.h:110)
==3861139==    by 0x185649F5: cascade_one (lib/common/timerwheel.c:278)
==3861139==    by 0x18564870: h2o_timerwheel_get_expired (lib/common/timerwheel.c:321)
==3861139==    by 0x1855A377: h2o_evloop_run (lib/common/socket/evloop.c.h:932)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)

This looks like custom code calling into libh2o, is this the first valgrind error you see? If so, it might help to increase its freelist --freelist-vol=200000000 (one more zero than the default)

I am using libh2o, and integrating it in my C code. I am using a old checkout ec805ec5d544a1bfe2be105ae7d46b1b61673d2b

In Valgrind I only see this error. Although there is a long list of some mem leaks those are not related I think as all of them are in my codebase outside of libh2o.

However, I am seeing a lot of these blocks

==3861139== Invalid read of size 8
==3861139==    at 0x184C086B: h2o_send_error_503 (h2o.h:1796)
==3861139==    by 0x184C0826: process_timeout_req_item (server.c:170)
==3861139==    by 0x184C06A9: resp_timeout_cb (server.c:184)
==3861139==    by 0x1855A2DA: h2o_evloop_run (lib/common/socket/evloop.c.h:938)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)
==3861139==  Address 0x262202a8 is 280 bytes inside a block of size 1,176 free'd
==3861139==    at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3861139==    by 0x1859F238: handle_rst_stream_frame (lib/http2/connection.c:1202)
==3861139==    by 0x1859E7C8: expect_default (lib/http2/connection.c:1242)
==3861139==    by 0x1859C4AE: parse_input (lib/http2/connection.c:1285)
==3861139==    by 0x1859C4AE: on_read (lib/http2/connection.c:1327)
==3861139==    by 0x1855EC5A: read_on_ready (lib/common/socket/evloop.c.h:366)
==3861139==    by 0x1855EC5A: run_socket (lib/common/socket/evloop.c.h:834)
==3861139==    by 0x1855A269: run_pending (lib/common/socket/evloop.c.h:876)
==3861139==    by 0x1855A269: h2o_evloop_run (lib/common/socket/evloop.c.h:925)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)
==3861139==  Block was alloc'd at
==3861139==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==3861139==    by 0x185A54D0: h2o_mem_alloc (include/h2o/memory.h:441)
==3861139==    by 0x185A54D0: h2o_http2_stream_open (lib/http2/stream.c:39)
==3861139==    by 0x1859ED05: handle_headers_frame (lib/http2/connection.c:999)
==3861139==    by 0x1859E7C8: expect_default (lib/http2/connection.c:1242)
==3861139==    by 0x1859C4AE: parse_input (lib/http2/connection.c:1285)
==3861139==    by 0x1859C4AE: on_read (lib/http2/connection.c:1327)
==3861139==    by 0x1855EC5A: read_on_ready (lib/common/socket/evloop.c.h:366)
==3861139==    by 0x1855EC5A: run_socket (lib/common/socket/evloop.c.h:834)
==3861139==    by 0x1855A269: run_pending (lib/common/socket/evloop.c.h:876)
==3861139==    by 0x1855A269: h2o_evloop_run (lib/common/socket/evloop.c.h:925)
==3861139==    by 0x184BE225: server_loop (server.c:679)
==3861139==    by 0x4AA4608: start_thread (pthread_create.c:477)
==3861139==    by 0x52CC132: clone (clone.S:95)

I believe this block is arising due to an inefficient client side but I might be wrong, I am using curl and instead of maintaining an HTTP2 connection, I am opening a new connection for sending requests. Which might be resulting in go-away frames and cause the above behaviour. I also have logging at process_timeout_req_item so show when a timeout occurs. But this is not usually executed.

The issue was due to missing responses in async responses. With timeout, added always response sending in case of missing or error case.