andrewdmcleod / magpie-layer

testing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

units stuck in Waiting for peers

marosg42 opened this issue · comments

It happens quite often that after charm is installed, some units talk to master immediately, some stay in waiting for peers 10-20 minutes, some don't connect even after hours.
log on master unit shows it knows about all units and their IPs, clients are waiting. Restarting unit does not help, only removing and redeploying unit solves it.

I can confirm the behavior. Even restarting the juju agent does nothing. I can see two handlers being queued, but nothing happens.
The only way to recover is to redeploy the unit.

unit-magpie-external-20: 19:48:27 INFO juju.cmd running jujud [2.6.9 gc go1.11.13]
unit-magpie-external-20: 19:48:27 DEBUG juju.cmd   args: []string{"/var/lib/juju/tools/unit-magpie-external-20/jujud", "unit", "--data-dir", "/var/lib/juju", "--unit-name", "magpie-external/20", "--debug"}
unit-magpie-external-20: 19:48:27 DEBUG juju.agent read agent config, format "2.0"
unit-magpie-external-20: 19:48:27 INFO juju.cmd.jujud setting logging config to "<root>=WARNING;unit=DEBUG"
unit-magpie-external-20: 19:48:29 INFO unit.magpie-external/20.juju-log Reactive main running for hook leader-settings-changed
unit-magpie-external-20: 19:48:30 INFO unit.magpie-external/20.juju-log Initializing Leadership Layer (is follower)
unit-magpie-external-20: 19:48:30 DEBUG unit.magpie-external/20.juju-log tracer>
tracer: starting handler dispatch, 22 flags set
tracer: set flag config.default.check_bonds
tracer: set flag config.default.check_iperf
tracer: set flag config.default.check_local_hostname
tracer: set flag config.default.check_port_description
tracer: set flag config.default.dns_server
tracer: set flag config.default.dns_time
tracer: set flag config.default.dns_tries
tracer: set flag config.default.min_speed
tracer: set flag config.default.ping_timeout
tracer: set flag config.default.ping_tries
tracer: set flag config.default.required_mtu
tracer: set flag config.default.supress_status
tracer: set flag config.default.use_lldp
tracer: set flag config.set.check_iperf
tracer: set flag config.set.check_local_hostname
tracer: set flag config.set.dns_time
tracer: set flag config.set.dns_tries
tracer: set flag config.set.ping_timeout
tracer: set flag config.set.ping_tries
tracer: set flag iperf.installed
tracer: set flag iperf.listening
tracer: set flag magpie.joined
unit-magpie-external-20: 19:48:30 DEBUG unit.magpie-external/20.juju-log tracer: hooks phase, 0 handlers queued
unit-magpie-external-20: 19:48:30 DEBUG unit.magpie-external/20.juju-log tracer>
tracer: main dispatch loop, 2 handlers queued
tracer: ++   queue handler reactive/magpie.py:18:install_lldp_pkg
tracer: ++   queue handler reactive/magpie.py:40:check_check_state
unit-magpie-external-20: 19:48:30 INFO unit.magpie-external/20.juju-log Invoking reactive handler: reactive/magpie.py:18:install_lldp_pkg
unit-magpie-external-20: 19:48:30 INFO unit.magpie-external/20.juju-log Invoking reactive handler: reactive/magpie.py:40:check_check_state

this should be fixed in the latest version in the charm store in my namespace - can you test and reconfirm? (or build from master here and test)