Binary leaking in h2_connection processes

Question

Binary leaking in h2_connection processes

jonasrichard opened this issue 7 years ago · comments

I use chatterbox as a dependency of the inaka/apns4erl project. After some hours of running of the application I faced with ever growing process sizes (process_info(pid, memory)). Did some investigation I found this:

process_info(pid(0,1238,0), binary).
{binary,[{139665827264464,60,1},
         {139665831210656,60,1},
         {139664600088128,60,1},
         {139665831140616,60,1},
         {139665831222376,60,1},
         {139664600244240,60,1},
         {139665827188416,60,1},
         {139664600522856,60,1},
         {139664605657712,60,1},
         {139664605954696,60,1},
         {139664605903872,60,1},
         {139664606052304,60,1},
         {139664600433752,60,1},
         {139664605649112,60,1},
         {139664531257216,60,1},
         {139665827110208,60,1},
         {139664605153864,60,1},
         {139665827411976,60,1},
         {139665827397160,60,1},
         {139665827115824,60,1},
         {139664605243344,60,1},
         {139664605133416,60,1},
         {139664605420536,60,1},
         {139664605405256,60,1},
         {139664637840368,60,...},
         {139664638044448,...},
         {...}|...]}
(myhost@my_ip)48> length(element(2, v(47))).
1851

Do you have a clue where this binary leak can occur?

Fred Hebert · Answer 1 · Wed Nov 01 2017 05:44:58 GMT+0800 (China Standard Time)

If you call for a manual garbage collection of the process, do these go away? If so the leak is probably fixable through well-placed hibernation calls more than anything else.

Richard Jonas · Answer 2 · Thu Nov 02 2017 00:10:41 GMT+0800 (China Standard Time)

I did a erlang:garbage_collect/0 but the binaries didn't go away. I don't know the process structure of chatterbox, but I think there are two processes pass the binaries, and after some matching one process hold refc binaries. It is hard to find when those binaries remain, because there were at least 50,000 http2 requests were sent (so much more than the number of refc binaries). I think it happens only in edge cases.
I didn't use chatterbox directly but apns4erl.
https://github.com/inaka/apns4erl
Tomorrow I will check if other processes has something similar. Do you think in a crashdump I can see those binaries? Or the GC during crashdump will eliminate them?

Fred Hebert · Answer 3 · Thu Nov 02 2017 01:28:33 GMT+0800 (China Standard Time)

If you can see them in the crash dump, they were really being held in memory by the process. If they are gone, that's because the VM ran a GC before writing them to the dump.

Richard Jonas · Answer 4 · Thu Nov 02 2017 17:41:59 GMT+0800 (China Standard Time)

Ok, calling the garbage_collect(Pid) decreased the number of refc binaries from 1932 to 84. So probably we will call GC for only those processes when the refc binaries count is more than a threshold. Does it sound right (not too odd)?

Fred Hebert · Answer 5 · Thu Nov 02 2017 18:57:06 GMT+0800 (China Standard Time)

That can work temporarily. I think the ideal fixes in these cases for now is to modify the risky processes to run hibernation tasks from time to time when they get downtime so they force this on their own.

Richard Jonas · Answer 6 · Thu Nov 02 2017 19:51:31 GMT+0800 (China Standard Time)

It is also interesting that process_info(memory) and heap size in normal process_info gives me different results. And like I said, the reductions were huge. So I decided to reconnect in every 10 minute instead of the GC.

> process_info(P1, memory).
{memory,201303168}
> process_info(P1).
...
 {total_heap_size,17474138},
 {heap_size,2072833},
 {stack_size,10},
 {reductions,123302750526},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,41}]},
...

Tristan Sloughter · Answer 7 · Tue Jun 26 2018 07:07:59 GMT+0800 (China Standard Time)

Looking at this again, if it makes sense we could add a hibernate_after option to the gen_statem:start_link of h2_connection so it hibernates if idle.

Suggestions for what a good timeout would be?

Sebastian Weddmark Olsson · Answer 8 · Thu Aug 06 2020 16:45:34 GMT+0800 (China Standard Time)

Did anyone find the issue with the leaking memory? I have the same problem, the garbage_collect method mentioned above reclaimed about ~1/3 of the memory.

> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,73238512},<0.16417.364>},
 {{memory,73238512},<0.16368.364>},
 {{memory,73238512},<0.16350.364>},
 {{memory,73238512},<0.16343.364>},
 {{memory,70935368},<0.16419.364>},
 {{memory,70935368},<0.16372.364>},
 {{memory,70935368},<0.16325.364>},
 {{memory,69016080},<0.16440.364>},
 {{memory,69016080},<0.16429.364>},
 {{memory,69016080},<0.16422.364>},
 {{memory,69016080},<0.16405.364>},
 {{memory,69016080},<0.16384.364>},
 {{memory,69016080},<0.16338.364>},
 {{memory,69016080},<0.16337.364>},
 {{memory,69016080},<0.16319.364>},
 {{memory,61032264},<0.16391.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]
> process_info(list_to_pid("<0.16417.364>")).
[{current_function,{gen_statem,loop_receive,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.16300.364>,#Port<0.79272>]},
 {dictionary,[{'$initial_call',{h2_connection,init,1}},
              {'$ancestors',[<0.16300.364>,<0.16257.364>,
                             grpcbox_channel_sup,grpcbox_sup,<0.1063.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1062.0>},
 {total_heap_size,8866796},
 {heap_size,1439468},
 {stack_size,11},
 {reductions,47123903084},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6023}]},
 {suspending,[]}]
> garbage_collect(list_to_pid("<0.16384.364>")).
....
> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,85122328},<0.16391.364>},
 {{memory,55082824},<0.16440.364>},
 {{memory,52779680},<0.16372.364>},
 {{memory,50860392},<0.16429.364>},
 {{memory,50860392},<0.16325.364>},
 {{memory,50860392},<0.16319.364>},
 {{memory,48205672},<0.16422.364>},
 {{memory,48205672},<0.16337.364>},
 {{memory,43983240},<0.16343.364>},
 {{memory,42474712},<0.16384.364>},
 {{memory,42383832},<0.16419.364>},
 {{memory,40171568},<0.16350.364>},
 {{memory,40171568},<0.16338.364>},
 {{memory,38252280},<0.16417.364>},
 {{memory,38252280},<0.16405.364>},
 {{memory,38252280},<0.16368.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]

I'm using chatterbox via grpcbox.

Fred Hebert · Answer 9 · Thu Aug 06 2020 20:21:38 GMT+0800 (China Standard Time)

Yeah the post before yours does mention a possible remediation to hibernate when idle. Someone likely just needs to go and implement it.

Sebastian Weddmark Olsson · Answer 10 · Tue Aug 11 2020 17:20:03 GMT+0800 (China Standard Time)

So we have been tracing and found out that the client responses are only cleaned up from the h2_stream state if the function h2_connection:get_response/2 is called.

For instance grpcbox_client:unary_handler does not call this function.
for chatterbox, I think there should be an api for closing/removing a stream when not using get_response, and that function could then be called by client libraries such as grpcbox.

What do you think?

Tristan Sloughter · Answer 11 · Tue Aug 11 2020 22:28:02 GMT+0800 (China Standard Time)

The grpcbox client doesn't call it because it is getting the body from messages to the client pid, right? If that is the case then it sounds more like chatterbox should have a mode that it keeps the response until it is requested and another where it only sends to some pid.

Sebastian Weddmark Olsson · Answer 12 · Wed Aug 12 2020 04:16:15 GMT+0800 (China Standard Time)

yes, there seem to be two flows in h2; one with asynchronous bangs to a notify pid, and one with some synchronous parts. might be a good idea to split them up in two, or clarify which is which.

we went with an easy quick fix which I will try to test tomorrow. Basically just added h2_connection:get_response(Pid, Stream), after receiving the data in grpcbox_client:unary_handler/6

Tristan Sloughter · Answer 13 · Wed Aug 12 2020 05:58:34 GMT+0800 (China Standard Time)

Yea, pretty sure the notify a pid flow was added on as a quick hack at the time. Definitely could use some cleaning up.

Tristan Sloughter · Answer 14 · Tue Sep 13 2022 06:36:16 GMT+0800 (China Standard Time)

Looking at this again.

As far as I can tell from the code the only time frames are added to the queue that becomes the response body received is if the callback is undefined. So I don't get how the response body is built up if this is the case.

Tristan Sloughter · Answer 15 · Tue Sep 13 2022 06:54:14 GMT+0800 (China Standard Time)

Adding a hibernate_after option to h2_connection, but also merged a fix for terminating grpc streams and that may help, not sure, tsloughter#15