satori-com / mzbench

MZ Benchmarking

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}}

KrlosWd opened this issue · comments

Hello,

I configured mzbench to run with 13 nodes (13 workers and the extra node for aggregation), and I'm using vmq_mzbench to perform some benchmarking in a mqtt server. I'm running into an error whenever I increase the size of one of my pools over 35000. The error I'm getting in in the user log is: gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 204.

Since we are talking about the User log, I was wondering if this is the right place to ask for help or if it means that it is related to the worker module.

I'm adding some of my user log in case it is helpful.
Thanks for your help in advance!

Best,
Carlos

8:11:32.002 [warning] <0.6.0> lager_error_logger_h dropped 18 messages in the last second that exceeded the limit of 50 messages/sec
18:11:32.002 [info] <0.7.0> Application mqtt_worker started on node 'mzb_worker95_6@127.0.0.1'
18:11:31.515 [warning] <0.6.0> lager_error_logger_h dropped 18 messages in the last second that exceeded the limit of 50 messages/sec
18:11:31.515 [info] <0.7.0> Application mqtt_worker started on node 'mzb_worker95_13@127.0.0.1'
18:11:31.753 [warning] <0.6.0> lager_error_logger_h dropped 18 messages in the last second that exceeded the limit of 50 messages/sec
18:11:31.753 [info] <0.7.0> Application mqtt_worker started on node 'mzb_worker95_10@127.0.0.1'
18:17:32.812 [error] <0.195.0> gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 204
18:17:28.136 [error] <0.195.0> gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 204
18:17:30.424 [error] <0.195.0> CRASH REPORT Process mzb_time with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:terminate/7 line 826
18:17:30.643 [error] <0.131.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at <0.195.0> exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context child_terminated
18:17:40.218 [error] <0.195.0> CRASH REPORT Process mzb_time with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:terminate/7 line 826
18:17:41.074 [error] <0.131.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at <0.195.0> exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context child_terminated
18:17:47.025 [error] <0.6427.0> CRASH REPORT Process <0.6427.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:init_it/6 line 352
18:17:47.626 [error] <0.131.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at <0.195.0> exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context start_error
18:17:53.589 [error] <0.6429.0> CRASH REPORT Process <0.6429.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:init_it/6 line 352
18:17:53.814 [error] <0.131.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at {restarting,<0.195.0>} exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context start_error
18:17:59.323 [error] <0.6431.0> CRASH REPORT Process <0.6431.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:init_it/6 line 352
18:17:59.376 [error] <0.131.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at {restarting,<0.195.0>} exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context start_error
18:18:12.427 [error] <0.194.0> Supervisor mzb_interconnect_clients had child interconnect_client started with mzb_interconnect_client:start_link("192.168.144.16", 4804, director) at <0.222.0> exit with reason normal in context child_terminated
18:18:12.430 [error] <0.237.0> gen_server mzb_director terminated with reason: pool_crashed
18:18:12.444 [error] <0.237.0> CRASH REPORT Process mzb_director with 0 neighbours exited with reason: pool_crashed in gen_server:terminate/7 line 826
18:18:12.446 [error] <0.175.0> Supervisor mzb_bench_sup had child director started with mzb_director:start_link(<0.175.0>, "solo_bench_bdl", [{operation,false,make_install,[[{operation,false,git,["https://github.com/erlio/vmq_mzbench..."],...},...]],...},...], ['mzb_worker95_10@127.0.0.1','mzb_worker95_11@127.0.0.1','mzb_worker95_12@127.0.0.1','mzb_worker95_13@127.0.0.1',...], [{"nodes_num",13},{"bench_script_dir","/tmp/mz/bench-95-1495498255"},{"bench_workers_dir",["~/....",...]},...], #Fun<mzb_bench_sup.0.125040007>) at <0.237.0> exit with reason pool_crashed in context child_terminated
18:18:12.447 [error] <0.175.0> Supervisor mzb_bench_sup had child director started with mzb_director:start_link(<0.175.0>, "solo_bench_bdl", [{operation,false,make_install,[[{operation,false,git,["https://github.com/erlio/vmq_mzbench..."],...},...]],...},...], ['mzb_worker95_10@127.0.0.1','mzb_worker95_11@127.0.0.1','mzb_worker95_12@127.0.0.1','mzb_worker95_13@127.0.0.1',...], [{"nodes_num",13},{"bench_script_dir","/tmp/mz/bench-95-1495498255"},{"bench_workers_dir",["~/....",...]},...], #Fun<mzb_bench_sup.0.125040007>) at <0.237.0> exit with reason reached_max_restart_intensity in context shutdown
18:18:12.447 [error] <0.131.0> Supervisor mzb_sup had child bench_sup started with mzb_bench_sup:start_link() at <0.175.0> exit with reason shutdown in context child_terminated

Hi,
well, it could be related to the system itself (I mean, our code), anyway, I'd be happy to help.
Sometimes, system information could appear in userlog, normally it shouldn't.

Unfortunately, this message is not enough informative to find the solution, it could indicate different problems on node startup. You could login to this node and check its logs for more information, or you could try to start a pool with 35000 dummyworkers (some simple scenario like this: https://github.com/machinezone/mzbench/blob/master/examples.bdl/loop.bdl) on your cluster to check if your cluster works properly

Thank you

I think the bug that was causing what behavior was fixed recently. Please try again with updated version and feel free to reopen the issue if the bug still presents.