jobs_server heap grows indefinitely
RJ opened this issue · comments
After averaging 200 jobs/sec for around 24 hrs, here is process_info for jobs_server - note the heap_size (several gigs..):
process_info(whereis(jobs_server)).
[{registered_name,jobs_server},
{current_function,{gen_server,loop,6}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,0},
{messages,[]},
{links,[<0.122.0>]},
{dictionary,[{'$ancestors',[jobs_app,<0.121.0>]},
{'$initial_call',{jobs_server,init,1}}]},
{trap_exit,false},
{error_handler,error_handler},
{priority,high},
{group_leader,<0.120.0>},
{total_heap_size,631533590},
{heap_size,74732575},
{stack_size,9},
{reductions,1273474509},
{garbage_collection,[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,10},
{minor_gcs,2}]},
{suspending,[]}]
and after erlang:garbage_collect(whereis(jobs_server))
:
process_info(whereis(jobs_server)).
[{registered_name,jobs_server},
{current_function,{gen_server,loop,6}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,0},
{messages,[]},
{links,[<0.122.0>]},
{dictionary,[{'$ancestors',[jobs_app,<0.121.0>]},
{'$initial_call',{jobs_server,init,1}}]},
{trap_exit,false},
{error_handler,error_handler},
{priority,high},
{group_leader,<0.120.0>},
{total_heap_size,445440815},
{heap_size,445440815},
{stack_size,9},
{reductions,1276364801},
{garbage_collection,[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,10},
{minor_gcs,0}]},
{suspending,[]}]
The queue in question is defined like this:
{regulators, [
{counter, [{name, process_pool_msg}, {limit, 25}]}
]}
Any suggestions?
Also noticed similar issues including large_heap and long_gc warnings for jobs_server from sysmon
After a restart, I'm kept an eye on the heap_size - it is getting slightly reduced periodically (so it is doing some gc), but there's a definite steady upward trend
That's odd. Have you tried sys:get_status(jobs_server)
? Also, perhaps, process_info(whereis(jobs_server), monitors)
, although they didn't reveal anything strange when I tried on our systems.
Try the last commit.
It seems it's the info function that's doing the growing, and that's due to a stupidity on my part (even though I tried to prevent it).
Here are some operations that may help:
1> S = element(2,hd(element(2,lists:nth(3,lists:nth(5,element(4,sys:get_status(jobs_server))))))).
{st, ...,
#Fun<jobs_server.11.132406893>}
2> erts_debug:flat_size(S).
6345232
3> F = element(9,S).
#Fun<jobs_server.11.132406893>
4> erts_debug:flat_size(F).
6344860
I inserted a call to fix the info function from within the code_change function.
Ok will look at that, currently on my side, there's quite a few monitors, about 540 for 12 queues...
erts_debug:flat_size(S).
9698095
erts_debug:flat_size(F).
9697494
Regarding the monitors, perhaps you can peek into (S being the same as before):
5> Tab = element(5,S).
53277
6> ets:tab2list(Tab).
[]
There shouldn't be multiple entries for the same pid. If there are, it could be because processes repeatedly start jobs without calling done()
after each.
(A problem here is that Erlang provides no O(1) way to find out if we are already monitoring a process).
I'm not leaking monitors, and #st looks fine - that last commit did the trick though.
Loaded that and forced a gc, heap is tiny again now.
Thanks :)
Glad to hear it. :)
Will close the issue then. I tagged the new version as 0.2.5.