jobs_server heap grows indefinitely

Question

jobs_server heap grows indefinitely

RJ opened this issue 12 years ago · comments

After averaging 200 jobs/sec for around 24 hrs, here is process_info for jobs_server - note the heap_size (several gigs..):

process_info(whereis(jobs_server)).
[{registered_name,jobs_server},
 {current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.122.0>]},
 {dictionary,[{'$ancestors',[jobs_app,<0.121.0>]},
              {'$initial_call',{jobs_server,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,high},
 {group_leader,<0.120.0>},
 {total_heap_size,631533590},
 {heap_size,74732575},
 {stack_size,9},
 {reductions,1273474509},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,2}]},
 {suspending,[]}]

and after erlang:garbage_collect(whereis(jobs_server)):

process_info(whereis(jobs_server)).
[{registered_name,jobs_server},
 {current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.122.0>]},
 {dictionary,[{'$ancestors',[jobs_app,<0.121.0>]},
              {'$initial_call',{jobs_server,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,high},
 {group_leader,<0.120.0>},
 {total_heap_size,445440815},
 {heap_size,445440815},
 {stack_size,9},
 {reductions,1276364801},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,0}]},
 {suspending,[]}]

The queue in question is defined like this:

{regulators, [
            {counter, [{name, process_pool_msg}, {limit, 25}]}
        ]}

Any suggestions?

Pedram Nimreezi · Answer 1 · Mon Dec 17 2012 10:27:24 GMT+0800 (China Standard Time)

Also noticed similar issues including large_heap and long_gc warnings for jobs_server from sysmon

Richard Jones · Answer 2 · Mon Dec 17 2012 10:57:18 GMT+0800 (China Standard Time)

After a restart, I'm kept an eye on the heap_size - it is getting slightly reduced periodically (so it is doing some gc), but there's a definite steady upward trend

Ulf Wiger · Answer 3 · Mon Dec 17 2012 15:00:39 GMT+0800 (China Standard Time)

That's odd. Have you tried sys:get_status(jobs_server)? Also, perhaps, process_info(whereis(jobs_server), monitors), although they didn't reveal anything strange when I tried on our systems.

Ulf Wiger · Answer 4 · Mon Dec 17 2012 15:26:31 GMT+0800 (China Standard Time)

Try the last commit.
It seems it's the info function that's doing the growing, and that's due to a stupidity on my part (even though I tried to prevent it).

Here are some operations that may help:

1> S = element(2,hd(element(2,lists:nth(3,lists:nth(5,element(4,sys:get_status(jobs_server))))))). 
{st, ...,
    #Fun<jobs_server.11.132406893>}
2> erts_debug:flat_size(S).
6345232
3> F = element(9,S).
#Fun<jobs_server.11.132406893>
4> erts_debug:flat_size(F).
6344860

I inserted a call to fix the info function from within the code_change function.

Pedram Nimreezi · Answer 5 · Mon Dec 17 2012 15:27:33 GMT+0800 (China Standard Time)

Ok will look at that, currently on my side, there's quite a few monitors, about 540 for 12 queues...

Pedram Nimreezi · Answer 6 · Mon Dec 17 2012 15:29:09 GMT+0800 (China Standard Time)

erts_debug:flat_size(S).
9698095

erts_debug:flat_size(F).
9697494

Ulf Wiger · Answer 7 · Mon Dec 17 2012 15:42:19 GMT+0800 (China Standard Time)

Regarding the monitors, perhaps you can peek into (S being the same as before):

5> Tab = element(5,S).
53277
6> ets:tab2list(Tab).
[]

There shouldn't be multiple entries for the same pid. If there are, it could be because processes repeatedly start jobs without calling done() after each.

(A problem here is that Erlang provides no O(1) way to find out if we are already monitoring a process).

Richard Jones · Answer 8 · Mon Dec 17 2012 19:49:04 GMT+0800 (China Standard Time)

I'm not leaking monitors, and #st looks fine - that last commit did the trick though.
Loaded that and forced a gc, heap is tiny again now.
Thanks :)

Ulf Wiger · Answer 9 · Mon Dec 17 2012 20:04:57 GMT+0800 (China Standard Time)

Glad to hear it. :)

Will close the issue then. I tagged the new version as 0.2.5.