joedevivo / chatterbox

HTTP/2 Server for Erlang. Boy, that guy was a real chatterbox waddn't he? I didn't think he was ever going to stop with the story.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Binary leaking in h2_connection processes

jonasrichard opened this issue · comments

I use chatterbox as a dependency of the inaka/apns4erl project. After some hours of running of the application I faced with ever growing process sizes (process_info(pid, memory)). Did some investigation I found this:

process_info(pid(0,1238,0), binary).
{binary,[{139665827264464,60,1},
         {139665831210656,60,1},
         {139664600088128,60,1},
         {139665831140616,60,1},
         {139665831222376,60,1},
         {139664600244240,60,1},
         {139665827188416,60,1},
         {139664600522856,60,1},
         {139664605657712,60,1},
         {139664605954696,60,1},
         {139664605903872,60,1},
         {139664606052304,60,1},
         {139664600433752,60,1},
         {139664605649112,60,1},
         {139664531257216,60,1},
         {139665827110208,60,1},
         {139664605153864,60,1},
         {139665827411976,60,1},
         {139665827397160,60,1},
         {139665827115824,60,1},
         {139664605243344,60,1},
         {139664605133416,60,1},
         {139664605420536,60,1},
         {139664605405256,60,1},
         {139664637840368,60,...},
         {139664638044448,...},
         {...}|...]}
(myhost@my_ip)48> length(element(2, v(47))).
1851

Do you have a clue where this binary leak can occur?

If you call for a manual garbage collection of the process, do these go away? If so the leak is probably fixable through well-placed hibernation calls more than anything else.

I did a erlang:garbage_collect/0 but the binaries didn't go away. I don't know the process structure of chatterbox, but I think there are two processes pass the binaries, and after some matching one process hold refc binaries. It is hard to find when those binaries remain, because there were at least 50,000 http2 requests were sent (so much more than the number of refc binaries). I think it happens only in edge cases.
I didn't use chatterbox directly but apns4erl.
https://github.com/inaka/apns4erl
Tomorrow I will check if other processes has something similar. Do you think in a crashdump I can see those binaries? Or the GC during crashdump will eliminate them?

If you can see them in the crash dump, they were really being held in memory by the process. If they are gone, that's because the VM ran a GC before writing them to the dump.

Ok, calling the garbage_collect(Pid) decreased the number of refc binaries from 1932 to 84. So probably we will call GC for only those processes when the refc binaries count is more than a threshold. Does it sound right (not too odd)?

That can work temporarily. I think the ideal fixes in these cases for now is to modify the risky processes to run hibernation tasks from time to time when they get downtime so they force this on their own.

It is also interesting that process_info(memory) and heap size in normal process_info gives me different results. And like I said, the reductions were huge. So I decided to reconnect in every 10 minute instead of the GC.

> process_info(P1, memory).
{memory,201303168}
> process_info(P1).
...
 {total_heap_size,17474138},
 {heap_size,2072833},
 {stack_size,10},
 {reductions,123302750526},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,41}]},
...

Looking at this again, if it makes sense we could add a hibernate_after option to the gen_statem:start_link of h2_connection so it hibernates if idle.

Suggestions for what a good timeout would be?

Did anyone find the issue with the leaking memory? I have the same problem, the garbage_collect method mentioned above reclaimed about ~1/3 of the memory.

> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,73238512},<0.16417.364>},
 {{memory,73238512},<0.16368.364>},
 {{memory,73238512},<0.16350.364>},
 {{memory,73238512},<0.16343.364>},
 {{memory,70935368},<0.16419.364>},
 {{memory,70935368},<0.16372.364>},
 {{memory,70935368},<0.16325.364>},
 {{memory,69016080},<0.16440.364>},
 {{memory,69016080},<0.16429.364>},
 {{memory,69016080},<0.16422.364>},
 {{memory,69016080},<0.16405.364>},
 {{memory,69016080},<0.16384.364>},
 {{memory,69016080},<0.16338.364>},
 {{memory,69016080},<0.16337.364>},
 {{memory,69016080},<0.16319.364>},
 {{memory,61032264},<0.16391.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]
> process_info(list_to_pid("<0.16417.364>")).
[{current_function,{gen_statem,loop_receive,3}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.16300.364>,#Port<0.79272>]},
 {dictionary,[{'$initial_call',{h2_connection,init,1}},
              {'$ancestors',[<0.16300.364>,<0.16257.364>,
                             grpcbox_channel_sup,grpcbox_sup,<0.1063.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1062.0>},
 {total_heap_size,8866796},
 {heap_size,1439468},
 {stack_size,11},
 {reductions,47123903084},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6023}]},
 {suspending,[]}]
> garbage_collect(list_to_pid("<0.16384.364>")).
....
> lists:reverse(lists:usort([{process_info(P, memory), case process_info(P, registered_name) of {_, N} -> N; _ -> P end} || P <- processes()])).
[{{memory,85122328},<0.16391.364>},
 {{memory,55082824},<0.16440.364>},
 {{memory,52779680},<0.16372.364>},
 {{memory,50860392},<0.16429.364>},
 {{memory,50860392},<0.16325.364>},
 {{memory,50860392},<0.16319.364>},
 {{memory,48205672},<0.16422.364>},
 {{memory,48205672},<0.16337.364>},
 {{memory,43983240},<0.16343.364>},
 {{memory,42474712},<0.16384.364>},
 {{memory,42383832},<0.16419.364>},
 {{memory,40171568},<0.16350.364>},
 {{memory,40171568},<0.16338.364>},
 {{memory,38252280},<0.16417.364>},
 {{memory,38252280},<0.16405.364>},
 {{memory,38252280},<0.16368.364>},
 {{memory,4119652},code_server},
 {{memory,1803720},<0.1733.0>},
 {{memory,1803720},<0.1274.0>},
 {{memory,...},...},
 {{...},...},
 {...}|...]

I'm using chatterbox via grpcbox.

Yeah the post before yours does mention a possible remediation to hibernate when idle. Someone likely just needs to go and implement it.

So we have been tracing and found out that the client responses are only cleaned up from the h2_stream state if the function h2_connection:get_response/2 is called.

For instance grpcbox_client:unary_handler does not call this function.
for chatterbox, I think there should be an api for closing/removing a stream when not using get_response, and that function could then be called by client libraries such as grpcbox.

What do you think?

The grpcbox client doesn't call it because it is getting the body from messages to the client pid, right? If that is the case then it sounds more like chatterbox should have a mode that it keeps the response until it is requested and another where it only sends to some pid.

yes, there seem to be two flows in h2; one with asynchronous bangs to a notify pid, and one with some synchronous parts. might be a good idea to split them up in two, or clarify which is which.

we went with an easy quick fix which I will try to test tomorrow. Basically just added h2_connection:get_response(Pid, Stream), after receiving the data in grpcbox_client:unary_handler/6

Yea, pretty sure the notify a pid flow was added on as a quick hack at the time. Definitely could use some cleaning up.

Looking at this again.

As far as I can tell from the code the only time frames are added to the queue that becomes the response body received is if the callback is undefined. So I don't get how the response body is built up if this is the case.

Adding a hibernate_after option to h2_connection, but also merged a fix for terminating grpc streams and that may help, not sure, tsloughter#15