issue with haproxy reloads

Question

issue with haproxy reloads

ProbablyRusty opened this issue 9 years ago · comments

Using consul-template (v0.11.0) with haproxy, I am seeing an issue in which multiple haproxy processes stack up over time as consul-template rewrites the haproxy.cfg file and fires off the reload command.

In this scenario, consul-template is running as root.

Here is the config:

consul = "127.0.0.1:8500"

template {
  source = "/etc/haproxy/haproxy.template"
  destination = "/etc/haproxy/haproxy.cfg"
  command = "haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )"
}

Manually running the reload command (haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid) works fine and does not stack up multiple haproxy processes.

But, for example, after a few consul-template rewrites of haproxy.cfg, here is what I see:

10258 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10029
10262 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10258
10270 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10266
10369 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10365
10427 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10423
10483 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10479

Any thoughts on what might be happening here, and why the behavior differs from consul-template running this reload command, and my running this reload command manually (outside the purview of consul-template) from a shell?

Seth Vargo · Answer 1 · Wed Oct 21 2015 09:59:48 GMT+0800 (China Standard Time)

Hi @consultantRR

It looks like this might be a duplicate of #428. Are you running in a container by chance?

Rusty Ross · Answer 2 · Wed Oct 21 2015 10:02:24 GMT+0800 (China Standard Time)

@sethvargo No I am not. I originally saw #428 and was hopeful for an answer, but since it seemed to be container specific, I decided to open a separate issue.

For clarity:

I am running consul-template in Amazon Linux and it is being invoked (as root) as follows:

nohup consul-template -config /etc/haproxy/consul-template.hcl >/dev/null 2>&1 &

Seth Vargo · Answer 3 · Wed Oct 21 2015 23:35:53 GMT+0800 (China Standard Time)

@consultantRR

Can you change that to print to a logfile and also run in debug mode and paste the output here after an haproxy restart please?

Rusty Ross · Answer 4 · Thu Oct 22 2015 01:09:14 GMT+0800 (China Standard Time)

Okay @sethvargo, I have a debug log covering the time period in which 8 haproxy restarts took place. I had hoped to simplify this log example and show only 1 haproxy restart, but I was not able to reproduce the issue with only a single restart. Due to length, I will paste log lines covering 1 restart at the end of this post. (Looks like business as usual to me - I see no issues.)

In this example, each restart took place about 6-7 seconds after the previous one. Each time, I invoked this restart by taking a node referenced in the template in or out of Consul maintenance mode.

Prior to this example log, one haproxy process was running. After this example log (8 restarts), three haproxy processes were left running permanently.

To be clear, this was the invocation of consul-template for this test:

nohup consul-template -log-level debug -config /etc/haproxy/consul-template.hcl >/var/log/consul-template.log 2>&1 &

Here are the first few lines of the log, showing the config:

nohup: ignoring input
2015/10/21 15:59:14 [DEBUG] (config) loading configs from "/etc/haproxy/consul-template.hcl"
2015/10/21 15:59:14 [DEBUG] (logging) enabling syslog on LOCAL0
2015/10/21 15:59:14 [INFO] consul-template v0.11.0
2015/10/21 15:59:14 [INFO] (runner) creating new runner (dry: false, once: false)
2015/10/21 15:59:14 [DEBUG] (runner) final config (tokens suppressed):

{
  "path": "/etc/haproxy/consul-template.hcl",
  "consul": "127.0.0.1:8500",
  "auth": {
    "enabled": false,
    "username": "",
    "password": ""
  },
  "vault": {
    "renew": true,
    "ssl": {
      "enabled": true,
      "verify": true
    }
  },
  "ssl": {
    "enabled": false,
    "verify": true
  },
  "syslog": {
    "enabled": true,
    "facility": "LOCAL0"
  },
  "max_stale": 1000000000,
  "templates": [
    {
      "source": "/etc/haproxy/haproxy.template",
      "destination": "/etc/haproxy/haproxy.cfg",
      "command": "haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )",
      "perms": 420
    }
  ],
  "retry": 5000000000,
  "wait": {
    "min": 0,
    "max": 0
  },
  "pid_file": "",
  "log_level": "debug"
}

2015/10/21 15:59:14 [INFO] (runner) creating consul/api client
2015/10/21 15:59:14 [DEBUG] (runner) setting consul address to 127.0.0.1:8500

Maybe a red herring, but possibly of interest:

During this 8-restart test, I had a separate haproxy node running with the exact same config (and template) as the node I have logged here. Only difference was that consul-template on that node was not logging. Invocation for consul-template on that node was:

nohup consul-template -config /etc/haproxy/consul-template.hcl >/dev/null 2>&1 &

On this node, after the same 8-restart test, 8 haproxy processes were left running (as opposed to 3 haproxy processes on the logged node). In further tests, extra haproxy processes do seem to stack up much more quickly on this node, then the one that debug logging is now enabled on.

I may try to craft a methodology for a simpler, more isolated and controlled test which still shows this behavior. If so, I will post results here.

For now, here is part of the debug log, covering 1 restart:

2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 1 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 1 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [INFO] (runner) running
2015/10/21 16:17:55 [DEBUG] (runner) checking template /etc/haproxy/haproxy.template
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] (runner) checking ctemplate &{Source:/etc/haproxy/haproxy.template Destination:/etc/haproxy/haproxy.cfg Command:haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid ) Perms:-rw-r--r--}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) wouldRender: true, didRender: true
2015/10/21 16:17:55 [DEBUG] (runner) appending command: haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )
2015/10/21 16:17:55 [INFO] (runner) diffing and updating dependencies
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "key(*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "services" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "key(*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) running command: `haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )`
2015/10/21 16:17:55 [INFO] (runner) watching 35 dependencies

Rusty Ross · Answer 5 · Thu Oct 22 2015 02:14:26 GMT+0800 (China Standard Time)

@sethvargo I just reverted this node to a former consul-template version and performed this same test (8 restarts) with consul-template v0.10.0 and at the end of the test, one 1 haproxy process was running.

By comparison, at the end of this new test the other haproxy node (running v0.11.0 with no logging) had 8 haproxy processes running.

Seth Vargo · Answer 6 · Thu Oct 22 2015 02:17:54 GMT+0800 (China Standard Time)

Hi @consultantRR

Can you share your actual Consul Template template please?

Seth Vargo · Answer 7 · Thu Oct 22 2015 02:20:01 GMT+0800 (China Standard Time)

The full diff between 0.10.0 and 0.11.0 is here: v0.10.0...v0.11.0. I'm going to try to get a reproduction together, but I have been unsuccessful thus far

Rusty Ross · Answer 8 · Thu Oct 22 2015 02:20:53 GMT+0800 (China Standard Time)

Here it is, with only two minor redactions:

global
    daemon
    maxconn 4096
    log 127.0.0.1   local0
    daemon

defaults
    log global
    mode http
    timeout connect 5000ms
    timeout client 60000ms
    timeout server 60000ms
    option http-server-close

listen haproxyadmin
    bind *:8999
    stats enable
    stats auth haproxyadmin:{{key "*redacted*"}}

listen http_health_check 0.0.0.0:8080
    mode health
    option httpchk

frontend http_proxy
    bind *:8888
    acl non_ssl hdr(X-Forwarded-Proto) -i http
    redirect prefix {{key "*redacted*"}} code 301 if non_ssl
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
    acl {{$services.Name}} path_reg -i ^\/{{$services.Name}}(\/.*|\?.*)?${{end}}{{end}}{{end}}
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
    use_backend {{$services.Name}} if {{$services.Name}}{{end}}{{end}}{{end}}
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
backend {{$services.Name}}{{$this_service := $services.Name | regexReplaceAll "(.+)" "prod.$1"}}
    balance roundrobin{{range service $this_service}}
    server {{.Name}} {{.Address}}:{{.Port}} maxconn 8192{{end}}{{end}}{{end}}
    {{end}}

Seth Vargo · Answer 9 · Thu Oct 22 2015 02:27:22 GMT+0800 (China Standard Time)

@consultantRR just to be clear - you aren't using the vault integration at all, right?

Rusty Ross · Answer 10 · Thu Oct 22 2015 02:29:16 GMT+0800 (China Standard Time)

Not at this time.

The two key references in the template above are Consul KV keys.

Seth Vargo · Answer 11 · Thu Oct 22 2015 07:14:19 GMT+0800 (China Standard Time)

Hi @consultantRR

I did some digging today, and I was able to reproduce this issue exactly once, and then it stopped reproducing.

Are you able to reproduce this with something that isn't haproxy? What's interesting to me is that haproxy orphans itself (it's still running even after the command returns and consul template quits), but I wonder if there's a race condition there somehow.

Rusty Ross · Answer 12 · Thu Oct 22 2015 07:28:00 GMT+0800 (China Standard Time)

Hi @sethvargo - I haven't reproduced this with something other than haproxy, but also, I can't say that I have tried. ;)

Just to think out loud about what may be happening, here is the reload command again (it should be noted that the reload command in the haproxy init.d script in Amazon Linux is basically the exact same thing as this - in fact this is what I used for months, and only switched it to the explicit command below when beginning to troubleshoot this newfound problem):

haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )

So anyway, here is my understanding of how that command works, from TFM:

The '-st' and '-sf' command line options are used to inform previously running
processes that a configuration is being reloaded. They will receive the SIGTTOU
signal to ask them to temporarily stop listening to the ports so that the new
process can grab them. If anything wrong happens, the new process will send
them a SIGTTIN to tell them to re-listen to the ports and continue their normal
work. Otherwise, it will either ask them to finish (-sf) their work then softly
exit, or immediately terminate (-st), breaking existing sessions.

What's interesting is that firing this command manually multiple times from a shell works exactly as expected every time. And what's even more interesting to me is that this behavior doesn't seem to show up with consul-template v0.10.0 - I don't currently have a good idea as to how/why that would work differently in v0.11.0.

Rusty Ross · Answer 13 · Thu Oct 22 2015 07:35:55 GMT+0800 (China Standard Time)

It's anecdotal, but I did observe with two versions of v0.11.0 side by side, one with debug logging and one with normal logging straight to /dev/null that both exhibited this behavior, but the one with no logging pretty consistently orphaned more haproxies than the one with debug logging. (Before I switched the first one from no logs to debug logs, it was orphaning processes pretty consistently with the rate of orphans of the second node.) Anyway, if a race condition, maybe the extra overhead of logging does actually affect the behavior.

Rusty Ross · Answer 14 · Thu Oct 22 2015 07:45:01 GMT+0800 (China Standard Time)

There is almost definitely a discrepancy between v0.10.0 and v0.11.0 though. I just checked back on two nodes in the same environment, same config, same consul dc, same template, one with v0.10.0 and one with v0.11.0, and after several hours, one has a single haproxy process running, and the other has 40.

Seth Vargo · Answer 15 · Thu Oct 22 2015 12:47:43 GMT+0800 (China Standard Time)

@consultantRR okay - I spent a lot of time on this today, and I have a way to reproduce it.

I am able to reproduce this 100% of the time when /var/run/haproxy.pid:

Does not exist
Exists but is empty
Exists with a PID that isn't valid

I was able to reproduce this under CT master, 0.11.0, 0.10.0, and 0.9.0.

Because of this, I think the version issue is actually a red herring. I think the reason it "works" on CT v0.10.0 is that you already having a running haproxy process on those nodes, and you're trying to use CT v0.11.0 on a node that doesn't have an haproxy instance already running. I could be totally wrong, but that's my only way to reproduce this issue at the moment, because if haproxy isn't running, the PID is invalid, and haproxy does something really strange and hangs onto the subprocess its spawns, but it doesn't hang the parent process, so CT thinks it has exited.

Now, when the PID exists, they are both very happy:

Here is CT v0.11.0

root      1973  0.0  1.8  11440  7056 ?        Sl   04:43   0:00 consul-template -config /etc/haproxy/consul-template.hcl
root      2068  0.0  0.3  12300  1128 ?        Ss   04:44   0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 2046

and here is CT v0.10.0

root      1974  0.0  1.8   9468  6808 ?        Sl   04:42   0:00 consul-template -config /etc/haproxy/consul-template.hcl
root      2057  0.0  0.3  12300  1128 ?        Ss   04:44   0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 2035

Obviously the PIDs are different because these are different VMs, but it's the same exact script and configuration on all the machines.

Each time I change the key in Consul, the PID changes (meaning CT successfully reloads the process).

I'm not really sure where to go from here. I'm out of debugging possibilities, and I'm fairly certain this is a problem with the way haproxy issues its reloads.

Please let me know what you think.

Seth Vargo · Answer 16 · Thu Oct 22 2015 12:51:39 GMT+0800 (China Standard Time)

Here is the Vagrantfile I am using: https://gist.github.com/sethvargo/e1e81544a769f6df8686

James Phillips · Answer 17 · Thu Oct 22 2015 13:07:30 GMT+0800 (China Standard Time)

nohup consul-template -log-level debug -config /etc/haproxy/consul-template.hcl >/var/log/consul-template.log 2>&1 &

Random sanity check question - is it possible there are multiple consul-template instances running on any of these problem machines?

Rusty Ross · Answer 18 · Thu Oct 22 2015 13:15:19 GMT+0800 (China Standard Time)

First of all, thank you @sethvargo, immensely, for the attention to this and the time spent thus far. It is much, much appreciated. And is a hallmark of HashiCorp, and your work in particular, Seth.

I am not sure what to think about this either.

I can say that my tests today have included:

Starting both CT 10 and 11 with haproxy already running.
Stopping all haproxy processes and restarting haproxy while CT is running.

In all cases CT 11 has exhibited this behavior (with the template I posted above - haven't tried other templates) and in no cases has CT 10 exhibited this behavior.

In fact, when switching back from CT 10 to CT 11, I see this behavior again on the same node. And vice versa (switching from CT 11 to CT 10 eliminates this behavior in all cases on the same node).

One thing I have noticed:

In all cases (again this is always under CT 11 and never under CT 10) in which the node gets into a state in which >1 haproxy processes are erroneously running, exactly 2 invocations of service haproxy stop are needed to stop all haproxy processes, no matter how many are running.

Meaning, the "remedy" for n number of haproxies (as long as n>1) looks like this:

# ps ax | grep haproxy.cfg | grep -v grep | wc -l
6
# service haproxy stop
Stopping haproxy:                                          [  OK  ]
# service haproxy stop
Stopping haproxy:                                          [  OK  ]
# service haproxy stop
Stopping haproxy:                                          [FAILED]
# service haproxy start
Starting haproxy:                                            [  OK  ]
# ps ax | grep haproxy.cfg | grep -v grep | wc -l
1

And btw, thanks @slackpad for the sanity check - (sanity is always good!) - no, in all cases referenced above, I can confirm only a single instance of CT running.

Seth Vargo · Answer 19 · Thu Oct 22 2015 13:33:16 GMT+0800 (China Standard Time)

Hi @consultantRR

Could you try setting max_stale = 0 in your config and see if that makes any difference?

Rusty Ross · Answer 20 · Thu Oct 22 2015 13:54:08 GMT+0800 (China Standard Time)

To be sure, it didn't seem to like max_stale = 0, @sethvargo:

* error converting max_stale to string at "/etc/haproxy/consul-template.hcl"

But it was OK with max_stale = "0", and sadly, the behavior was the same. Wound up with 3 haproxies after about 12 reloads.

Seth Vargo · Answer 21 · Thu Oct 22 2015 15:26:30 GMT+0800 (China Standard Time)

@consultantRR well I'm officially out of ideas 😦 Maybe @armon or @ryanuber does?

Oscar Ferrer · Answer 22 · Thu Oct 22 2015 23:59:23 GMT+0800 (China Standard Time)

My 2 cents:

I have seen orphan haproxy instances frequently (and I'm not using consul-template with haproxy yet)
In my case I think it is related to websocket connections kept alive for a long time. If I run a netstat -apn | grep PID I see a dozen established connections.

Oscar Ferrer · Answer 23 · Fri Oct 23 2015 00:08:41 GMT+0800 (China Standard Time)

In other words, it would be interesting to see whether the 'orphan' haproxy instances are really orphan or they are just waiting for some side to actually close the connection.

Seth Vargo · Answer 24 · Fri Oct 23 2015 00:54:46 GMT+0800 (China Standard Time)

@zonzamas I'm not super familiar with haproxy - is there an easy way for @consultantRR to do that?

Oscar Ferrer · Answer 25 · Fri Oct 23 2015 01:01:34 GMT+0800 (China Standard Time)

I use standard linux tools

Being PID the actual PID from an oprhan

netstat -apn | grep PID to get established connections

strace -fp PID to see what a proccess it is actually doing

from netstat -apn I get something like:

tcp        0      0 10.0.1.6:57222          10.0.34.216:5672        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.1.60:58411         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:39279          10.0.32.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.2.206:53716        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:35486          10.0.16.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 127.0.0.1:81            127.0.0.1:56015         ESTABLISHED 18749/haproxy
tcp        0      0 127.0.0.1:81            127.0.0.1:38882         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:57163          10.0.32.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.2.206:53740        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:36761          10.0.16.60:5290         ESTABLISHED 18749/haproxy

Rusty Ross · Answer 26 · Fri Oct 23 2015 01:05:03 GMT+0800 (China Standard Time)

I will do some controlled testing on this @zonzamas - but offhand, I don't believe this is the issue as:

(1) the processes stay running for hours and only increase in number, they don't decrease
(2) this simply does not happen on any node (for me) running pre-v11 CT
(3) switching CT versions back and forth on the same node (respectively) begin or end this behavior immediately after the version switch

Actually, I should clarify: when I say "pre-v11 CT", I specifically mean v10. I have not tested versions prior to v10 in the context of this issue.

Seth Vargo · Answer 27 · Tue Oct 27 2015 04:35:06 GMT+0800 (China Standard Time)

I'm still at a loss for what might be causing this. I'm going to release CT 0.11.1 without this fix because I do not want to delay further. Sorry!

Paul Bauer · Answer 28 · Wed Oct 28 2015 05:47:05 GMT+0800 (China Standard Time)

Thanks @consultantRR and @sethvargo for the work on this.
We've nothing to add other than we observed the exact same behavior after upgrading to consul-template v0.11.1 from v0.10.0; after which we reverted back to v0.10.0. We are running haproxy in a container along with consul-template.

Rusty Ross · Answer 29 · Wed Oct 28 2015 22:57:16 GMT+0800 (China Standard Time)

Thank you @sethvargo – I suspected it would take time for us (collectively) to get to the bottom of this one, and am glad it didn't hold up the release of 0.11.1. I have been intending to do some additional testing and reporting here, but have been focussed elsewhere in the past few days. I'll be back on this soon, I hope.

@pmbauer – Interesting, and thanks for the report. That is helpful. Good to know this seems to be reproducible for you in the same ways I have been seeing.

Wojciech Sielski · Answer 30 · Sat Oct 31 2015 16:43:54 GMT+0800 (China Standard Time)

We also having the same issue since 0.11.0 (0.11.1 also affected).
Seems like haproxy cannot stop previous PID(s), so previous HAproxy still runs.
Even SIGTERM doesn't work, I had to use SIGKILL.

Alan Gibson · Answer 31 · Mon Nov 02 2015 02:38:04 GMT+0800 (China Standard Time)

I've also had to go back to v0.10.0 from v0.11.1 because of this issue. While investigating this, I read that this issue can becaused by soft reloading haproxy (using -st or -sf) so quickly that the previous process hasn't had time to die. Is it possible that consul-template is calling the configured command in rapid succession and/or ignoring the wait parameter?

Alan Gibson · Answer 32 · Mon Nov 02 2015 02:39:27 GMT+0800 (China Standard Time)

I've created Docker containers that install 0.10.0, 0.11.0 and v0.11.1 on top of the official haproxy image for anyone wanting to do a quick test on the differences in behavior.

https://hub.docker.com/r/alangibson/haproxy-consul-template/

Seth Vargo · Answer 33 · Mon Nov 02 2015 04:00:09 GMT+0800 (China Standard Time)

Is it possible that the newer version of Consul Template (which is compiled on Go 1.5.1) is just exiting faster and thus making this issue more obvious?

Consul Template doesn't do anything special with haproxy - it simply spawns a subprocess and returns when that subprocess reports it has finished. It's very standard, which makes me think this is a problem with the way haproxy performs reloads, and not a problem with CT itself.

Wojciech Sielski · Answer 34 · Mon Nov 02 2015 04:07:43 GMT+0800 (China Standard Time)

@sethvargo I cannot SIGTERM them, so HAproxy cannot do that - so definitely it is not a faster exiting.

Wojciech Sielski · Answer 35 · Mon Nov 02 2015 04:12:34 GMT+0800 (China Standard Time)

@sethvargo might be this is a problem with spawning subprocess in Go 1.5.1,
Can I still compile 0.11.1 against golang version which was used in 0.10.0 ? which version was it ?

Rusty Ross · Answer 36 · Mon Nov 02 2015 04:16:30 GMT+0800 (China Standard Time)

Also @sethvargo, if that were the case, doesn't it seem like we'd at least have seen a few cases of this behavior on v10, particularly in scenarios with faster hardware? I am not aware we've seen any cases at all prior to v11.

Seth Vargo · Answer 37 · Mon Nov 02 2015 05:03:33 GMT+0800 (China Standard Time)

@sielaq it was compiled using go 1.4.3 previously. You can try compiling CT 0.11.1 against the older version of go - I would love to hear what you come up with.

@consultantRR I'm honestly not sure. There's nothing in the codepath changes between 0.10 and v0.11 that would indicate a problem, and I've been unable to reproduce this issue with a bisect to try to narrow down the offending commit.

Wojciech Sielski · Answer 38 · Mon Nov 02 2015 05:40:58 GMT+0800 (China Standard Time)

@sethvargo yep, I can confirm, this issue comes with Go 1.5.1.
Definitely with Go 1.4.3 CT 0.11.1 (and 0.12-dev) all is fine !
I hope this gonna help you to find root reason.

Seth Vargo · Answer 39 · Mon Nov 02 2015 09:38:12 GMT+0800 (China Standard Time)

Oh boy. @sielaq this is super useful, thank you very much for doing that test. The good news is that we know the cause. The bad news is - I have no clue how to fix this besides the obvious of "compile with an old version of go".

I'm going to see if @mitchellh or @armon have any ideas.

/cc @slackpad

Seth Vargo · Answer 40 · Mon Nov 02 2015 09:50:11 GMT+0800 (China Standard Time)

For my own sanity, could someone try running the Go 1.5 version of CT (0.11.1) with the environment variable GOMAXPROCS=1 and see if the issue persists please?

Wojciech Sielski · Answer 41 · Mon Nov 02 2015 15:24:29 GMT+0800 (China Standard Time)

@sethvargo, sorry for late answer (TZ=CET 😄 ).
I have set GOMAXPROCS=1 and I'v got the same results,
so from my observation it has no influence.

Mitchell Hashimoto · Answer 42 · Tue Nov 03 2015 01:59:05 GMT+0800 (China Standard Time)

@sethvargo We should determine if this is caused by a Go 1.5 bug or of its just a race due to Go 1.5 being faster. The latter is our problem, the former obviously we can work with the Go team. In the interim, I don't think it'd hurt to recompile 0.11 with Go 1.4 while we fix this.

To do this though we'd need a pretty good repro. I thihnk the best way might be to mimik @consultantRR's setup which is just using Amazon Linux. We can use TF to just make a mimic environment and hopefully get the repro. I'm confident since @consultantRR can repro it 100% of the time that we'll be able to emulate.

Rusty Ross · Answer 43 · Tue Nov 03 2015 02:26:11 GMT+0800 (China Standard Time)

Great, @mitchellh – if you guys have any trouble with the repro, I am more than happy to pitch in and help you emulate my exact environment as closely as possible.

Wojciech Sielski · Answer 44 · Tue Nov 03 2015 03:52:00 GMT+0800 (China Standard Time)

@sethvargo @mitchellh
You can use our vagrant file from project too, to reproduce it - it will spawn full docker env with running consul and haproxy
but this contain much more that is needed, but easy to play with.
https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/Vagrantfile
I can help in any case.

kcd83 · Answer 45 · Fri Nov 06 2015 08:52:16 GMT+0800 (China Standard Time)

Hi everyone, I actually opened the issue with regard to the haproxy service script (sous-chefs/haproxy#114). Without consul-template this is very easy to reproduce. If you look through the script you'll see there is a window where concurrency causes the issue.

service haproxy reload & service haproxy reload

If a node contains multiple services and these all come up at once there will be multiple events which seem to produce multiple concurrent reloads. I imagine this could be resolved if consul-template either waited and collated these into one reload, or staggered the reloads (arbitrary sleep - yuck).

Rusty Ross · Answer 46 · Fri Nov 06 2015 09:04:44 GMT+0800 (China Standard Time)

Interesting, @kcd83. You think the reason this doesn't seem to reproduce in CT built on older Go versions is that execution is simply slower in just a basic race condition?

Wojciech Sielski · Answer 47 · Fri Nov 06 2015 15:06:20 GMT+0800 (China Standard Time)

@kcd83 nope, I'm using self made haproxy_reload.sh (with iptables switching) and I can run it like:

while true; do ./haproxy_reload.sh; done

and I have no problems at all. (problem appears when it is spawned by CT)

If you read correctly I have already shown the reason: processes spawned by CT (with Go 1.5) are not able to trap SIGTERM - the end.

kcd83 · Answer 48 · Fri Nov 06 2015 18:58:26 GMT+0800 (China Standard Time)

@sielaq your script runs sequentially, try two or more concurrent executions

Wojciech Sielski · Answer 49 · Fri Nov 06 2015 19:37:42 GMT+0800 (China Standard Time)

@kcd83 I can run it concurrent, iptables -w do the trick.

Wojciech Sielski · Answer 50 · Fri Nov 06 2015 20:00:43 GMT+0800 (China Standard Time)

Could be related:
https://go-review.googlesource.com/#/c/16592/

Wojciech Sielski · Answer 51 · Sat Nov 07 2015 02:36:34 GMT+0800 (China Standard Time)

@sethvargo I have created a dummy C code to reproduce it.
I have chosen C to be agnostic from golang compiler.
https://gist.github.com/sielaq/9867be2d1b65ffec658c

What it does:
it start a new daemon and kill any other PID given in command line - same like HAproxy do.

consul-template \
  -consul=10.0.0.1:8500 \
  -template haproxy.cfg.ctmpl:/tmp/haproxy.cfg:'./trap_signal $(pidof trap_signal)'

If you ran with CT (Golang 1.4) -> you see /tmp/log.txt that contains received signals
If you ran with CT (Golang 1.5) -> you see nothing new in /tmp/log.txt
and more and more of instances trap_signal in process list (that cannot be killed)

If you not gonna be able to reproduce it
I can create docker container with all pieces.

kcd83 · Answer 52 · Sat Nov 07 2015 08:21:59 GMT+0800 (China Standard Time)

The iptables -w is a good addition to your script.

The issue I see with any concurrent reload script is the haproxy -sf $(cat pid) step can do this:

Thread 1: read pid
Thread 2: read pid
Thread 1: haproxy -sf pid
Thread 2: haproxy -sf pid
Thread 1: kills original pid
Thread 2: tries to kill original pid, does not kill new thread 1 pid

I need to confirm if this stale -sf is what causes haproxy to ignore the SIGTERM. I haven't tested it

Wojciech Sielski · Answer 53 · Sat Nov 07 2015 14:13:08 GMT+0800 (China Standard Time)

@kcd83 read my comment above, you can use this simple app instead of HAproxy to reproduce the problem. You constantly try to find race condition here, while the problem exist even when TC trigger haproxy reload in 1 minute intervals.

Wojciech Sielski · Answer 54 · Sun Nov 08 2015 05:52:39 GMT+0800 (China Standard Time)

There is a similar issue.
golang/go#13164
I have created small app there to reproduce it ...

Rusty Ross · Answer 55 · Mon Nov 16 2015 06:19:08 GMT+0800 (China Standard Time)

Just for completeness, I wanted to anecdotally confirm on this thread that I rolled a build of CT v0.11.1 which was compiled on Go 1.4.3 into the infrastructure tested and described all throughout this thread. This has been in place for several days on multiple nodes, and since then, the issue has not occurred once.

Eric Seifert · Answer 56 · Fri Dec 04 2015 03:15:36 GMT+0800 (China Standard Time)

Hello, sorry if this is not relevant, but I found this thread while having a similar issue. I am not using consul, just haproxy with haproxy-systemd-wrapper and supervisord in a container. I always have an issue with concurrent reloads creating multiple haproxy processes, but recently started seeing multiple even when guaranteeing the reloads were serialized.

What I saw was the pid file would be either empty or missing for a small time after issuing the signal. If another reload came in during this time I would end up with multiple processes. So what was really happening was when I had a large config file the startup time of haproxy seems to be a bit longer and would take longer to populate the pid file causing the issue. Now my restart script waits for the pid file to be present, non-empty and have a different pid than before the restart before exiting. This seems to solve the problem for me. Here is a snippet I have, HAWRAP_PID is my haproxy-systemd-wrapper pid:

OLD_PID=`cat /var/run/haproxy.pid`
kill -SIGUSR2 $HAWRAP_PID
NEW_PID=`cat /var/run/haproxy.pid 2>/dev/null || true`
while [ "$NEW_PID" == "" ] || [ "$OLD_PID" == "$NEW_PID" ]; do
  sleep 0.05
  NEW_PID=`cat /var/run/haproxy.pid 2>/dev/null || true`
done

Oscar Renalias · Answer 57 · Tue Dec 08 2015 15:10:09 GMT+0800 (China Standard Time)

We were also experiencing this issue after we upgraded to 0.11.1, in a reverse proxy container where consul-template is used in combination with haproxy. I can confirm that downgrading to 0.10.0 solved the issue for us (didn't have time to compile 0.11.1 with Go 1.4, sorry).

Seth Vargo · Answer 58 · Fri Dec 11 2015 06:39:21 GMT+0800 (China Standard Time)

Hi @consultantRR

How did you get Consul Template to compile against Go 1.4.3? I'm getting errors from our gatedio library because of an API that only exists in go 1.5:

b.b.Cap undefined (type *bytes.Buffer has no field or method Cap)

Rusty Ross · Answer 59 · Fri Dec 11 2015 07:32:30 GMT+0800 (China Standard Time)

@sethvargo:

I just spoke with my colleague who actually did the compile and his response was "it just compiled". :)

He did say it was compiled inside the 1.4.3 Golang official docker image, and that it looks like the gateio lib was added the day after he successfully compiled it.

If you remain unable to get it to compile, let me know and we'll see if we can get it working.

Seth Vargo · Answer 60 · Fri Dec 11 2015 07:36:01 GMT+0800 (China Standard Time)

@consultantRR I just tried the golang 1.4.2 and 1.4.3 to no avail 😦

I'm going to release Consul Template v0.12.0 under Go 1.5.2 and then spend some time next week figuring out if there's anything we can do here.

Wojciech Sielski · Answer 61 · Fri Dec 11 2015 14:33:57 GMT+0800 (China Standard Time)

@sethvargo I have tried under linux with gvm and compiling works fine.

gvm install go1.4.3
gvm use go1.4.3

Seth Vargo · Answer 62 · Fri Dec 11 2015 23:12:25 GMT+0800 (China Standard Time)

@sielaq are you running the latest master with the most recent deps?

Wojciech Sielski · Answer 63 · Sat Dec 12 2015 08:51:39 GMT+0800 (China Standard Time)

@sethvargo go get / go build always take the latest master by default.
so yes I have run against latest master.

I don't doubt your super skills, you are much better than me,
but If you have some different results, this could be some env issue,
you can try to run on any pure Ubuntu with gvm

bash < <(curl -s -S -L https://raw.githubusercontent.com/moovweb/gvm/master/binscripts/gvm-installer)
source "$HOME/.gvm/scripts/gvm"
gvm ...

Jason Walton · Answer 64 · Wed Dec 16 2015 02:56:41 GMT+0800 (China Standard Time)

My co-worker @kreisys and I have been seeing this problem too. One thing to note; if you run:

service haproxy reload & service haproxy reload & service haproxy reload & service haproxy reload & service haproxy reload

This will spawn a bunch of reloads of haproxy concurrently in the background. What we see happen is we get a whole bunch of haproxy instances all with the same -sf parameters, so there's a bunch of haproxy instances competing to try to take control of the socket and kill the old haproxy.

Is consul-template trying to run the same command multiple times concurrently? Could we fix this just by preventing this?

Victor Trac · Answer 65 · Fri Dec 18 2015 08:13:14 GMT+0800 (China Standard Time)

I had this issue as well with haproxy 1.5/1.6 and consul-template 0.12. We previously had no issues with consul-template 0.10. Recompiling 0.12 using golang 1.4 does seem to fix the problem. If it's helpful for anyone, here's my linux amd64 binary:

https://dl.dropboxusercontent.com/u/515268/consul-template-0.12-go1.4

Raphaël Rondeau · Answer 66 · Sat Dec 19 2015 00:09:20 GMT+0800 (China Standard Time)

Same here, I use consul-template 0.12.0 after recompiling it with go 1.4.3.
Works like a charm and no more problems with haproxy.

kcd83 · Answer 67 · Mon Dec 21 2015 14:38:18 GMT+0800 (China Standard Time)

A workaround is to use quiescence (not just in everyday conversation).

But seriously, a wait as low as 3 seconds seems sufficient to stop the rapid flip-flopping that generates multiple haproxy instances which I believe is exactly what this parameter is designed for.

{
  "consul": "localhost:8500",
  "wait": "3s"
}

Some thrash testing is required to prove it is 100%.

Paul Bauer · Answer 68 · Mon Dec 21 2015 14:42:05 GMT+0800 (China Standard Time)

@kcd83 I don't think that solves the root problem, see golang/go#13164

kcd83 · Answer 69 · Mon Dec 21 2015 15:35:17 GMT+0800 (China Standard Time)

@pmbauer OK I agree, thanks for pointing that out

wait is does not address the root cause however changing the configuration of your typical application in a fraction of a second is probably not helpful

Wojciech Sielski · Answer 70 · Mon Dec 21 2015 15:43:10 GMT+0800 (China Standard Time)

btw. as @ianlancetaylor mentioned seems like main issue was in golang signals
https://go-review.googlesource.com/#/c/18064/
that was very sophisticated bug

Seth Vargo · Answer 71 · Fri Jan 08 2016 00:47:13 GMT+0800 (China Standard Time)

Just as an update here - it looks like this has been fixed in Go 1.6 (unreleased) and was a bug in Go itself.

I'm going to keep this issue open until Go 1.6 is officially released.

Rusty Ross · Answer 72 · Fri Jan 08 2016 00:49:56 GMT+0800 (China Standard Time)

Thanks @sethvargo, this is great news.

strarsis · Answer 73 · Fri Jan 08 2016 06:00:25 GMT+0800 (China Standard Time)

Go 1.6beta is out,
you can also pull + build with its image from Dockerhub
to verify whether the bug still persists:
https://hub.docker.com/_/golang/

Rine le Comte · Answer 74 · Tue Jan 12 2016 20:25:18 GMT+0800 (China Standard Time)

We encounter a similar problem with Docker/Consul-template on AWS EC2 instances, bringing the EC2 instance to 100% CPU. We use the 0.12.1 version. The debug message: Reaping child process 0.
Actually we are invoking nginx.

Our workaround is to add the -reap=false to the command, but we are not sure what the side effects are. Please advise.

Elghazal Ahmed · Answer 75 · Tue Jan 12 2016 20:41:30 GMT+0800 (China Standard Time)

We have the same issue with 0.12.0 Reaping child process 0. -reap=false can fix it but when diaseabling reap we are left with a lot of HAProxy zombies.

Edit: Actually, my error message is [ERR] (runner) error running command: wait: no child processes not Reaping child process 0. Sorry :/

James Phillips · Answer 76 · Wed Jan 13 2016 01:56:43 GMT+0800 (China Standard Time)

Hi @rlcomte and @geniousphp setting -reap=false should work around the issue, though it may cause zombie processes to be left behind. I split this out into #507 so we can figure out what's going on, since it should be unrelated to this issue. Can you please provide any details about the environment where you see this over there?

Sune Keller · Answer 77 · Tue Jan 19 2016 19:37:03 GMT+0800 (China Standard Time)

I've tried rebuilding consul-template v0.13.0-dev (7c000ce) with go version go1.6beta2 linux/amd64, and I can no longer reproduce the issue of unreaped subprocesses. I am using consul-template to run haproxy in an Alpine container btw.

Sune Keller · Answer 78 · Tue Jan 19 2016 21:01:59 GMT+0800 (China Standard Time)

PS @sethvargo could the waiting-reply label be removed to reflect the issue's current status?

Ross Duggan · Answer 79 · Tue Jan 26 2016 19:37:50 GMT+0800 (China Standard Time)

Until there's an official release from HashiCorp, I've a build of 0.12.2 with Go 1.6 ~~beta2~~ ~~rc1~~ ~~rc2~~ release via Circle CI with included Dockerfile, etc, for those who would like to build it themselves. Some might find it useful.

I'd have used an older build, but I'd already started relying on 0.11 syntax and features 😅

Sylvain Boily · Answer 80 · Fri Jan 29 2016 10:13:27 GMT+0800 (China Standard Time)

Hello, same problem is resolved with consul template 0.12.2 and go 1.6rc1 for me. Thank you @duggan for your dockerfile!

Barry Kaplan · Answer 81 · Wed Feb 03 2016 20:25:56 GMT+0800 (China Standard Time)

I don't have much to add to the above. But we upgrade from 0.10.x to 0.12.2 today and we ended up with dozens of haproxies.

brycechesternewman · Answer 82 · Fri Feb 05 2016 04:57:49 GMT+0800 (China Standard Time)

We tried 0.12.2 on CoreOS 835.12.0 using haproxy 1.6.3 without success. There are no errors in consul-template logs or haproxy logs. After a few hours our network service(Midokura) started to get overloaded with a large number of SYN requests. We rolled back to 0.10.2 and the network service(Midokura) recovered.

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:         eth0 111390.07      0.00 124128.80      0.00      0.00      0.00      0.68
Average:         eth1  18888.36      0.00   4970.44      0.00      0.00      0.00      0.68
Average:         eth2   1880.14   7096.23    121.18   9651.90      0.00      0.00      0.00

```22:23:30.261584 IP (tos 0x0, ttl 63, id 46818, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.51858 > 172.16.2.13.webcache: Flags [S], cksum 0x5c09 (incorrect -> 0x2383), seq 614922376, win 29200, options [mss 1460,sackOK,TS val 14044778 ecr 0,nop,wscale 7], length 0
22:23:30.261641 IP (tos 0x0, ttl 63, id 52396, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.33782 > 172.16.2.12.webcache: Flags [S], cksum 0x5c08 (incorrect -> 0x1013), seq 4101341894, win 29200, options [mss 1460,sackOK,TS val 14044778 ecr 0,nop,wscale 7], length 0
22:23:30.263228 IP (tos 0x0, ttl 63, id 22526, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.41282 > 172.16.1.199.webcache: Flags [S], cksum 0x5bc3 (incorrect -> 0xe280), seq 1848711573, win 29200, options [mss 1460,sackOK,TS val 14044779 ecr 0,nop,wscale 7], length 0

Justin Clayton · Answer 83 · Fri Feb 05 2016 05:21:19 GMT+0800 (China Standard Time)

@brycechesternewman unless you used @duggan 's custom build against unreleased Go 1.6, the issue is still there.

Adam Greene · Answer 84 · Thu Feb 18 2016 07:53:32 GMT+0800 (China Standard Time)

go 1.6 was released today: https://golang.org/dl/#go1.6

@sethvargo can a new release of consul-template be cut please?

Seth Vargo · Answer 85 · Fri Feb 19 2016 00:55:46 GMT+0800 (China Standard Time)

0.13.0 is released and compiled with go 1.6!

Rusty Ross · Answer 86 · Fri Feb 19 2016 00:56:40 GMT+0800 (China Standard Time)

Our long national nightmare is over. :)

Adam Greene · Answer 87 · Fri Feb 19 2016 03:24:49 GMT+0800 (China Standard Time)

thank you @sethvargo !

Chris Dickson · Answer 88 · Mon Mar 14 2016 22:21:17 GMT+0800 (China Standard Time)

I'm running the 0.13.0 release pulled from the following URL and still seeing occurrences of multiple HAProxy processes running after a service reload. Is anybody else still seeing this issue?

https://releases.hashicorp.com/consul-template/0.13.0/consul-template_0.13.0_linux_amd64.zip

systeminsightsbuild · Answer 89 · Mon Mar 14 2016 22:53:27 GMT+0800 (China Standard Time)

Yes, just today we found five extra haproxy's with 0.13.0. Clearly this
issue persists. And we have already upgraded production to 0.13.0, so I
guess we need to scramble to roll back.

On Mon, Mar 14, 2016 at 7:51 PM, Chris Dickson notifications@github.com
wrote:

I'm running the 0.13.0 release pulled from the following URL and still
seeing occurrences of multiple HAProxy processes running after a service
reload. Is anybody else still seeing this issue?

https://releases.hashicorp.com/consul-template/0.13.0/consul-template_0.13.0_linux_amd64.zip

—
Reply to this email directly or view it on GitHub
#442 (comment)
.

Barry Kaplan
barry@systeminsights.com
CIO, CISO
System Insights

Chris Dickson · Answer 90 · Wed Mar 16 2016 03:48:24 GMT+0800 (China Standard Time)

@sethvargo Should I open a new issue to investigate the persistence of this bug, or can we reopen this one?

Victor Trac · Answer 91 · Wed Mar 16 2016 03:55:44 GMT+0800 (China Standard Time)

FWIW, 0.13.0 fixes the problem for me. haproxy reload is supposed to leave behind the previous previous process with the -sf [pid] flag to let the previous process finish out its requests. On a busy system that reloads frequently, it's expected to see some number of previous haproxy processes still running. If you refresh the ps command, you should see the haproxy -sf [pid] rotate PIDs.

systeminsightsbuild · Answer 92 · Wed Mar 16 2016 08:40:39 GMT+0800 (China Standard Time)

Victor, it mostly fixed the problem for us as well. At least until 0.13 was
released to now. Iit can't be just that as the old haproxy's are still
handling requests. Well, at least they are issuing 503s since their view of
the services they are routing to is now invalid. And we have never ever
seen this problem prior to 0.12, so clearly something new is happening.

On Wed, Mar 16, 2016 at 1:25 AM, Victor Trac notifications@github.com
wrote:

FWIW, 0.13.0 fixes the problem for me. haproxy reload is supposed to leave
behind the previous previous process with the -sf [pid] flag to let the
previous process finish out its requests. On a busy system that reloads
frequently, it's expected to see some number of previous haproxy processes
still running. If you refresh the ps command, you should see the haproxy
-sf [pid] rotate PIDs.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#442 (comment)

Barry Kaplan
barry@systeminsights.com
CIO, CISO
System Insights

Guido · Answer 93 · Wed Mar 23 2016 16:25:44 GMT+0800 (China Standard Time)

We've added wait = "1s" to /etc/consul-template/config.d/consul-template.conf to work-around this problem for now.

Seth Vargo · Answer 94 · Wed Mar 23 2016 19:04:12 GMT+0800 (China Standard Time)

Hi everyone,

Since this issue seems to be getting a lot of noise, I'm going to clarify that this has been fixed in Consul Template 0.13+. Please open any other concerns as separate issues. haproxy itself will leave behind previous processes for a short period of time to finish existing requests, so unless you see a multitude of haproxy processes running each time CT restarts the process, this issue does not apply to you 😄