esl / jobs

Job scheduler for load regulation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

jobs_server heap grows indefinitely

RJ opened this issue · comments

After averaging 200 jobs/sec for around 24 hrs, here is process_info for jobs_server - note the heap_size (several gigs..):

process_info(whereis(jobs_server)).
[{registered_name,jobs_server},
 {current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.122.0>]},
 {dictionary,[{'$ancestors',[jobs_app,<0.121.0>]},
              {'$initial_call',{jobs_server,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,high},
 {group_leader,<0.120.0>},
 {total_heap_size,631533590},
 {heap_size,74732575},
 {stack_size,9},
 {reductions,1273474509},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,2}]},
 {suspending,[]}]

and after erlang:garbage_collect(whereis(jobs_server)):

process_info(whereis(jobs_server)).
[{registered_name,jobs_server},
 {current_function,{gen_server,loop,6}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,0},
 {messages,[]},
 {links,[<0.122.0>]},
 {dictionary,[{'$ancestors',[jobs_app,<0.121.0>]},
              {'$initial_call',{jobs_server,init,1}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,high},
 {group_leader,<0.120.0>},
 {total_heap_size,445440815},
 {heap_size,445440815},
 {stack_size,9},
 {reductions,1276364801},
 {garbage_collection,[{min_bin_vheap_size,46368},
                      {min_heap_size,233},
                      {fullsweep_after,10},
                      {minor_gcs,0}]},
 {suspending,[]}]

The queue in question is defined like this:

{regulators, [
            {counter, [{name, process_pool_msg}, {limit, 25}]}
        ]}

Any suggestions?

Also noticed similar issues including large_heap and long_gc warnings for jobs_server from sysmon

After a restart, I'm kept an eye on the heap_size - it is getting slightly reduced periodically (so it is doing some gc), but there's a definite steady upward trend

That's odd. Have you tried sys:get_status(jobs_server)? Also, perhaps, process_info(whereis(jobs_server), monitors), although they didn't reveal anything strange when I tried on our systems.

Try the last commit.
It seems it's the info function that's doing the growing, and that's due to a stupidity on my part (even though I tried to prevent it).

Here are some operations that may help:

1> S = element(2,hd(element(2,lists:nth(3,lists:nth(5,element(4,sys:get_status(jobs_server))))))). 
{st, ...,
    #Fun<jobs_server.11.132406893>}
2> erts_debug:flat_size(S).
6345232
3> F = element(9,S).
#Fun<jobs_server.11.132406893>
4> erts_debug:flat_size(F).
6344860

I inserted a call to fix the info function from within the code_change function.

Ok will look at that, currently on my side, there's quite a few monitors, about 540 for 12 queues...

erts_debug:flat_size(S).
9698095

erts_debug:flat_size(F).
9697494

Regarding the monitors, perhaps you can peek into (S being the same as before):

5> Tab = element(5,S).
53277
6> ets:tab2list(Tab).
[]

There shouldn't be multiple entries for the same pid. If there are, it could be because processes repeatedly start jobs without calling done() after each.

(A problem here is that Erlang provides no O(1) way to find out if we are already monitoring a process).

I'm not leaking monitors, and #st looks fine - that last commit did the trick though.
Loaded that and forced a gc, heap is tiny again now.
Thanks :)

Glad to hear it. :)

Will close the issue then. I tagged the new version as 0.2.5.