hashicorp / consul-template

Template rendering, notifier, and supervisor for @HashiCorp Consul and Vault data.

Home Page:https://www.hashicorp.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

issue with haproxy reloads

ProbablyRusty opened this issue · comments

Using consul-template (v0.11.0) with haproxy, I am seeing an issue in which multiple haproxy processes stack up over time as consul-template rewrites the haproxy.cfg file and fires off the reload command.

In this scenario, consul-template is running as root.

Here is the config:

consul = "127.0.0.1:8500"

template {
  source = "/etc/haproxy/haproxy.template"
  destination = "/etc/haproxy/haproxy.cfg"
  command = "haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )"
}

Manually running the reload command (haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid) works fine and does not stack up multiple haproxy processes.

But, for example, after a few consul-template rewrites of haproxy.cfg, here is what I see:

10258 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10029
10262 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10258
10270 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10266
10369 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10365
10427 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10423
10483 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10479

Any thoughts on what might be happening here, and why the behavior differs from consul-template running this reload command, and my running this reload command manually (outside the purview of consul-template) from a shell?

Hi @consultantRR

It looks like this might be a duplicate of #428. Are you running in a container by chance?

@sethvargo No I am not. I originally saw #428 and was hopeful for an answer, but since it seemed to be container specific, I decided to open a separate issue.

For clarity:

I am running consul-template in Amazon Linux and it is being invoked (as root) as follows:

nohup consul-template -config /etc/haproxy/consul-template.hcl >/dev/null 2>&1 &

@consultantRR

Can you change that to print to a logfile and also run in debug mode and paste the output here after an haproxy restart please?

Okay @sethvargo, I have a debug log covering the time period in which 8 haproxy restarts took place. I had hoped to simplify this log example and show only 1 haproxy restart, but I was not able to reproduce the issue with only a single restart. Due to length, I will paste log lines covering 1 restart at the end of this post. (Looks like business as usual to me - I see no issues.)

In this example, each restart took place about 6-7 seconds after the previous one. Each time, I invoked this restart by taking a node referenced in the template in or out of Consul maintenance mode.

Prior to this example log, one haproxy process was running. After this example log (8 restarts), three haproxy processes were left running permanently.

To be clear, this was the invocation of consul-template for this test:

nohup consul-template -log-level debug -config /etc/haproxy/consul-template.hcl >/var/log/consul-template.log 2>&1 &

Here are the first few lines of the log, showing the config:

nohup: ignoring input
2015/10/21 15:59:14 [DEBUG] (config) loading configs from "/etc/haproxy/consul-template.hcl"
2015/10/21 15:59:14 [DEBUG] (logging) enabling syslog on LOCAL0
2015/10/21 15:59:14 [INFO] consul-template v0.11.0
2015/10/21 15:59:14 [INFO] (runner) creating new runner (dry: false, once: false)
2015/10/21 15:59:14 [DEBUG] (runner) final config (tokens suppressed):

{
  "path": "/etc/haproxy/consul-template.hcl",
  "consul": "127.0.0.1:8500",
  "auth": {
    "enabled": false,
    "username": "",
    "password": ""
  },
  "vault": {
    "renew": true,
    "ssl": {
      "enabled": true,
      "verify": true
    }
  },
  "ssl": {
    "enabled": false,
    "verify": true
  },
  "syslog": {
    "enabled": true,
    "facility": "LOCAL0"
  },
  "max_stale": 1000000000,
  "templates": [
    {
      "source": "/etc/haproxy/haproxy.template",
      "destination": "/etc/haproxy/haproxy.cfg",
      "command": "haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )",
      "perms": 420
    }
  ],
  "retry": 5000000000,
  "wait": {
    "min": 0,
    "max": 0
  },
  "pid_file": "",
  "log_level": "debug"
}

2015/10/21 15:59:14 [INFO] (runner) creating consul/api client
2015/10/21 15:59:14 [DEBUG] (runner) setting consul address to 127.0.0.1:8500

Maybe a red herring, but possibly of interest:

During this 8-restart test, I had a separate haproxy node running with the exact same config (and template) as the node I have logged here. Only difference was that consul-template on that node was not logging. Invocation for consul-template on that node was:

nohup consul-template -config /etc/haproxy/consul-template.hcl >/dev/null 2>&1 &

On this node, after the same 8-restart test, 8 haproxy processes were left running (as opposed to 3 haproxy processes on the logged node). In further tests, extra haproxy processes do seem to stack up much more quickly on this node, then the one that debug logging is now enabled on.

I may try to craft a methodology for a simpler, more isolated and controlled test which still shows this behavior. If so, I will post results here.

For now, here is part of the debug log, covering 1 restart:

2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 1 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 1 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [INFO] (runner) running
2015/10/21 16:17:55 [DEBUG] (runner) checking template /etc/haproxy/haproxy.template
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] (runner) checking ctemplate &{Source:/etc/haproxy/haproxy.template Destination:/etc/haproxy/haproxy.cfg Command:haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid ) Perms:-rw-r--r--}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) wouldRender: true, didRender: true
2015/10/21 16:17:55 [DEBUG] (runner) appending command: haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )
2015/10/21 16:17:55 [INFO] (runner) diffing and updating dependencies
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "key(*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "services" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "key(*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) running command: `haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )`
2015/10/21 16:17:55 [INFO] (runner) watching 35 dependencies

@sethvargo I just reverted this node to a former consul-template version and performed this same test (8 restarts) with consul-template v0.10.0 and at the end of the test, one 1 haproxy process was running.

By comparison, at the end of this new test the other haproxy node (running v0.11.0 with no logging) had 8 haproxy processes running.

Hi @consultantRR

Can you share your actual Consul Template template please?

The full diff between 0.10.0 and 0.11.0 is here: v0.10.0...v0.11.0. I'm going to try to get a reproduction together, but I have been unsuccessful thus far

Here it is, with only two minor redactions:

global
    daemon
    maxconn 4096
    log 127.0.0.1   local0
    daemon

defaults
    log global
    mode http
    timeout connect 5000ms
    timeout client 60000ms
    timeout server 60000ms
    option http-server-close

listen haproxyadmin
    bind *:8999
    stats enable
    stats auth haproxyadmin:{{key "*redacted*"}}

listen http_health_check 0.0.0.0:8080
    mode health
    option httpchk

frontend http_proxy
    bind *:8888
    acl non_ssl hdr(X-Forwarded-Proto) -i http
    redirect prefix {{key "*redacted*"}} code 301 if non_ssl
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
    acl {{$services.Name}} path_reg -i ^\/{{$services.Name}}(\/.*|\?.*)?${{end}}{{end}}{{end}}
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
    use_backend {{$services.Name}} if {{$services.Name}}{{end}}{{end}}{{end}}
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
backend {{$services.Name}}{{$this_service := $services.Name | regexReplaceAll "(.+)" "prod.$1"}}
    balance roundrobin{{range service $this_service}}
    server {{.Name}} {{.Address}}:{{.Port}} maxconn 8192{{end}}{{end}}{{end}}
    {{end}}

@consultantRR just to be clear - you aren't using the vault integration at all, right?

Not at this time.

The two key references in the template above are Consul KV keys.

Hi @consultantRR

I did some digging today, and I was able to reproduce this issue exactly once, and then it stopped reproducing.

Are you able to reproduce this with something that isn't haproxy? What's interesting to me is that haproxy orphans itself (it's still running even after the command returns and consul template quits), but I wonder if there's a race condition there somehow.

Hi @sethvargo - I haven't reproduced this with something other than haproxy, but also, I can't say that I have tried. ;)

Just to think out loud about what may be happening, here is the reload command again (it should be noted that the reload command in the haproxy init.d script in Amazon Linux is basically the exact same thing as this - in fact this is what I used for months, and only switched it to the explicit command below when beginning to troubleshoot this newfound problem):

haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )

So anyway, here is my understanding of how that command works, from TFM:

The '-st' and '-sf' command line options are used to inform previously running
processes that a configuration is being reloaded. They will receive the SIGTTOU
signal to ask them to temporarily stop listening to the ports so that the new
process can grab them. If anything wrong happens, the new process will send
them a SIGTTIN to tell them to re-listen to the ports and continue their normal
work. Otherwise, it will either ask them to finish (-sf) their work then softly
exit, or immediately terminate (-st), breaking existing sessions.

What's interesting is that firing this command manually multiple times from a shell works exactly as expected every time. And what's even more interesting to me is that this behavior doesn't seem to show up with consul-template v0.10.0 - I don't currently have a good idea as to how/why that would work differently in v0.11.0.

It's anecdotal, but I did observe with two versions of v0.11.0 side by side, one with debug logging and one with normal logging straight to /dev/null that both exhibited this behavior, but the one with no logging pretty consistently orphaned more haproxies than the one with debug logging. (Before I switched the first one from no logs to debug logs, it was orphaning processes pretty consistently with the rate of orphans of the second node.) Anyway, if a race condition, maybe the extra overhead of logging does actually affect the behavior.

There is almost definitely a discrepancy between v0.10.0 and v0.11.0 though. I just checked back on two nodes in the same environment, same config, same consul dc, same template, one with v0.10.0 and one with v0.11.0, and after several hours, one has a single haproxy process running, and the other has 40.

@consultantRR okay - I spent a lot of time on this today, and I have a way to reproduce it.

I am able to reproduce this 100% of the time when /var/run/haproxy.pid:

  1. Does not exist
  2. Exists but is empty
  3. Exists with a PID that isn't valid

I was able to reproduce this under CT master, 0.11.0, 0.10.0, and 0.9.0.

Because of this, I think the version issue is actually a red herring. I think the reason it "works" on CT v0.10.0 is that you already having a running haproxy process on those nodes, and you're trying to use CT v0.11.0 on a node that doesn't have an haproxy instance already running. I could be totally wrong, but that's my only way to reproduce this issue at the moment, because if haproxy isn't running, the PID is invalid, and haproxy does something really strange and hangs onto the subprocess its spawns, but it doesn't hang the parent process, so CT thinks it has exited.

Now, when the PID exists, they are both very happy:

Here is CT v0.11.0

root      1973  0.0  1.8  11440  7056 ?        Sl   04:43   0:00 consul-template -config /etc/haproxy/consul-template.hcl
root      2068  0.0  0.3  12300  1128 ?        Ss   04:44   0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 2046

and here is CT v0.10.0

root      1974  0.0  1.8   9468  6808 ?        Sl   04:42   0:00 consul-template -config /etc/haproxy/consul-template.hcl
root      2057  0.0  0.3  12300  1128 ?        Ss   04:44   0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 2035

Obviously the PIDs are different because these are different VMs, but it's the same exact script and configuration on all the machines.

Each time I change the key in Consul, the PID changes (meaning CT successfully reloads the process).

I'm not really sure where to go from here. I'm out of debugging possibilities, and I'm fairly certain this is a problem with the way haproxy issues its reloads.

Please let me know what you think.

nohup consul-template -log-level debug -config /etc/haproxy/consul-template.hcl >/var/log/consul-template.log 2>&1 &

Random sanity check question - is it possible there are multiple consul-template instances running on any of these problem machines?

First of all, thank you @sethvargo, immensely, for the attention to this and the time spent thus far. It is much, much appreciated. And is a hallmark of HashiCorp, and your work in particular, Seth.

I am not sure what to think about this either.

I can say that my tests today have included:

Starting both CT 10 and 11 with haproxy already running.
Stopping all haproxy processes and restarting haproxy while CT is running.

In all cases CT 11 has exhibited this behavior (with the template I posted above - haven't tried other templates) and in no cases has CT 10 exhibited this behavior.

In fact, when switching back from CT 10 to CT 11, I see this behavior again on the same node. And vice versa (switching from CT 11 to CT 10 eliminates this behavior in all cases on the same node).

One thing I have noticed:

In all cases (again this is always under CT 11 and never under CT 10) in which the node gets into a state in which >1 haproxy processes are erroneously running, exactly 2 invocations of service haproxy stop are needed to stop all haproxy processes, no matter how many are running.

Meaning, the "remedy" for n number of haproxies (as long as n>1) looks like this:

# ps ax | grep haproxy.cfg | grep -v grep | wc -l
6
# service haproxy stop
Stopping haproxy:                                          [  OK  ]
# service haproxy stop
Stopping haproxy:                                          [  OK  ]
# service haproxy stop
Stopping haproxy:                                          [FAILED]
# service haproxy start
Starting haproxy:                                            [  OK  ]
# ps ax | grep haproxy.cfg | grep -v grep | wc -l
1

And btw, thanks @slackpad for the sanity check - (sanity is always good!) - no, in all cases referenced above, I can confirm only a single instance of CT running.

Hi @consultantRR

Could you try setting max_stale = 0 in your config and see if that makes any difference?

To be sure, it didn't seem to like max_stale = 0, @sethvargo:

* error converting max_stale to string at "/etc/haproxy/consul-template.hcl"

But it was OK with max_stale = "0", and sadly, the behavior was the same. Wound up with 3 haproxies after about 12 reloads.

@consultantRR well I'm officially out of ideas 😦 Maybe @armon or @ryanuber does?

My 2 cents:

I have seen orphan haproxy instances frequently (and I'm not using consul-template with haproxy yet)
In my case I think it is related to websocket connections kept alive for a long time. If I run a netstat -apn | grep PID I see a dozen established connections.

In other words, it would be interesting to see whether the 'orphan' haproxy instances are really orphan or they are just waiting for some side to actually close the connection.

@zonzamas I'm not super familiar with haproxy - is there an easy way for @consultantRR to do that?

I use standard linux tools

Being PID the actual PID from an oprhan

netstat -apn | grep PID to get established connections

strace -fp PID to see what a proccess it is actually doing

from netstat -apn I get something like:

tcp        0      0 10.0.1.6:57222          10.0.34.216:5672        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.1.60:58411         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:39279          10.0.32.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.2.206:53716        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:35486          10.0.16.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 127.0.0.1:81            127.0.0.1:56015         ESTABLISHED 18749/haproxy
tcp        0      0 127.0.0.1:81            127.0.0.1:38882         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:57163          10.0.32.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.2.206:53740        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:36761          10.0.16.60:5290         ESTABLISHED 18749/haproxy

I will do some controlled testing on this @zonzamas - but offhand, I don't believe this is the issue as:

(1) the processes stay running for hours and only increase in number, they don't decrease
(2) this simply does not happen on any node (for me) running pre-v11 CT
(3) switching CT versions back and forth on the same node (respectively) begin or end this behavior immediately after the version switch

Actually, I should clarify: when I say "pre-v11 CT", I specifically mean v10. I have not tested versions prior to v10 in the context of this issue.

I'm still at a loss for what might be causing this. I'm going to release CT 0.11.1 without this fix because I do not want to delay further. Sorry!

Thanks @consultantRR and @sethvargo for the work on this.
We've nothing to add other than we observed the exact same behavior after upgrading to consul-template v0.11.1 from v0.10.0; after which we reverted back to v0.10.0. We are running haproxy in a container along with consul-template.

Thank you @sethvargo – I suspected it would take time for us (collectively) to get to the bottom of this one, and am glad it didn't hold up the release of 0.11.1. I have been intending to do some additional testing and reporting here, but have been focussed elsewhere in the past few days. I'll be back on this soon, I hope.

@pmbauer – Interesting, and thanks for the report. That is helpful. Good to know this seems to be reproducible for you in the same ways I have been seeing.

We also having the same issue since 0.11.0 (0.11.1 also affected).
Seems like haproxy cannot stop previous PID(s), so previous HAproxy still runs.
Even SIGTERM doesn't work, I had to use SIGKILL.

I've also had to go back to v0.10.0 from v0.11.1 because of this issue. While investigating this, I read that this issue can becaused by soft reloading haproxy (using -st or -sf) so quickly that the previous process hasn't had time to die. Is it possible that consul-template is calling the configured command in rapid succession and/or ignoring the wait parameter?

I've created Docker containers that install 0.10.0, 0.11.0 and v0.11.1 on top of the official haproxy image for anyone wanting to do a quick test on the differences in behavior.

https://hub.docker.com/r/alangibson/haproxy-consul-template/

Is it possible that the newer version of Consul Template (which is compiled on Go 1.5.1) is just exiting faster and thus making this issue more obvious?

Consul Template doesn't do anything special with haproxy - it simply spawns a subprocess and returns when that subprocess reports it has finished. It's very standard, which makes me think this is a problem with the way haproxy performs reloads, and not a problem with CT itself.

@sethvargo I cannot SIGTERM them, so HAproxy cannot do that - so definitely it is not a faster exiting.

@sethvargo might be this is a problem with spawning subprocess in Go 1.5.1,
Can I still compile 0.11.1 against golang version which was used in 0.10.0 ? which version was it ?

Also @sethvargo, if that were the case, doesn't it seem like we'd at least have seen a few cases of this behavior on v10, particularly in scenarios with faster hardware? I am not aware we've seen any cases at all prior to v11.

@sielaq it was compiled using go 1.4.3 previously. You can try compiling CT 0.11.1 against the older version of go - I would love to hear what you come up with.

@consultantRR I'm honestly not sure. There's nothing in the codepath changes between 0.10 and v0.11 that would indicate a problem, and I've been unable to reproduce this issue with a bisect to try to narrow down the offending commit.

@sethvargo yep, I can confirm, this issue comes with Go 1.5.1.
Definitely with Go 1.4.3 CT 0.11.1 (and 0.12-dev) all is fine !
I hope this gonna help you to find root reason.

Oh boy. @sielaq this is super useful, thank you very much for doing that test. The good news is that we know the cause. The bad news is - I have no clue how to fix this besides the obvious of "compile with an old version of go".

I'm going to see if @mitchellh or @armon have any ideas.

/cc @slackpad

For my own sanity, could someone try running the Go 1.5 version of CT (0.11.1) with the environment variable GOMAXPROCS=1 and see if the issue persists please?

@sethvargo, sorry for late answer (TZ=CET 😄 ).
I have set GOMAXPROCS=1 and I'v got the same results,
so from my observation it has no influence.

@sethvargo We should determine if this is caused by a Go 1.5 bug or of its just a race due to Go 1.5 being faster. The latter is our problem, the former obviously we can work with the Go team. In the interim, I don't think it'd hurt to recompile 0.11 with Go 1.4 while we fix this.

To do this though we'd need a pretty good repro. I thihnk the best way might be to mimik @consultantRR's setup which is just using Amazon Linux. We can use TF to just make a mimic environment and hopefully get the repro. I'm confident since @consultantRR can repro it 100% of the time that we'll be able to emulate.

Great, @mitchellh – if you guys have any trouble with the repro, I am more than happy to pitch in and help you emulate my exact environment as closely as possible.

@sethvargo @mitchellh
You can use our vagrant file from project too, to reproduce it - it will spawn full docker env with running consul and haproxy
but this contain much more that is needed, but easy to play with.
https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/Vagrantfile
I can help in any case.

commented

Hi everyone, I actually opened the issue with regard to the haproxy service script (sous-chefs/haproxy#114). Without consul-template this is very easy to reproduce. If you look through the script you'll see there is a window where concurrency causes the issue.

service haproxy reload & service haproxy reload

If a node contains multiple services and these all come up at once there will be multiple events which seem to produce multiple concurrent reloads. I imagine this could be resolved if consul-template either waited and collated these into one reload, or staggered the reloads (arbitrary sleep - yuck).

Interesting, @kcd83. You think the reason this doesn't seem to reproduce in CT built on older Go versions is that execution is simply slower in just a basic race condition?

@kcd83 nope, I'm using self made haproxy_reload.sh (with iptables switching) and I can run it like:

while true; do ./haproxy_reload.sh; done

and I have no problems at all. (problem appears when it is spawned by CT)

If you read correctly I have already shown the reason: processes spawned by CT (with Go 1.5) are not able to trap SIGTERM - the end.

commented

@sielaq your script runs sequentially, try two or more concurrent executions

@kcd83 I can run it concurrent, iptables -w do the trick.

@sethvargo I have created a dummy C code to reproduce it.
I have chosen C to be agnostic from golang compiler.
https://gist.github.com/sielaq/9867be2d1b65ffec658c

What it does:
it start a new daemon and kill any other PID given in command line - same like HAproxy do.

consul-template \
  -consul=10.0.0.1:8500 \
  -template haproxy.cfg.ctmpl:/tmp/haproxy.cfg:'./trap_signal $(pidof trap_signal)'

If you ran with CT (Golang 1.4) -> you see /tmp/log.txt that contains received signals
If you ran with CT (Golang 1.5) -> you see nothing new in /tmp/log.txt
and more and more of instances trap_signal in process list (that cannot be killed)

If you not gonna be able to reproduce it
I can create docker container with all pieces.

commented

The iptables -w is a good addition to your script.

The issue I see with any concurrent reload script is the haproxy -sf $(cat pid) step can do this:

Thread 1: read pid
Thread 2: read pid
Thread 1: haproxy -sf pid
Thread 2: haproxy -sf pid
Thread 1: kills original pid
Thread 2: tries to kill original pid, does not kill new thread 1 pid

I need to confirm if this stale -sf is what causes haproxy to ignore the SIGTERM. I haven't tested it

@kcd83 read my comment above, you can use this simple app instead of HAproxy to reproduce the problem. You constantly try to find race condition here, while the problem exist even when TC trigger haproxy reload in 1 minute intervals.

There is a similar issue.
golang/go#13164
I have created small app there to reproduce it ...

Just for completeness, I wanted to anecdotally confirm on this thread that I rolled a build of CT v0.11.1 which was compiled on Go 1.4.3 into the infrastructure tested and described all throughout this thread. This has been in place for several days on multiple nodes, and since then, the issue has not occurred once.

Hello, sorry if this is not relevant, but I found this thread while having a similar issue. I am not using consul, just haproxy with haproxy-systemd-wrapper and supervisord in a container. I always have an issue with concurrent reloads creating multiple haproxy processes, but recently started seeing multiple even when guaranteeing the reloads were serialized.

What I saw was the pid file would be either empty or missing for a small time after issuing the signal. If another reload came in during this time I would end up with multiple processes. So what was really happening was when I had a large config file the startup time of haproxy seems to be a bit longer and would take longer to populate the pid file causing the issue. Now my restart script waits for the pid file to be present, non-empty and have a different pid than before the restart before exiting. This seems to solve the problem for me. Here is a snippet I have, HAWRAP_PID is my haproxy-systemd-wrapper pid:

OLD_PID=`cat /var/run/haproxy.pid`
kill -SIGUSR2 $HAWRAP_PID
NEW_PID=`cat /var/run/haproxy.pid 2>/dev/null || true`
while [ "$NEW_PID" == "" ] || [ "$OLD_PID" == "$NEW_PID" ]; do
  sleep 0.05
  NEW_PID=`cat /var/run/haproxy.pid 2>/dev/null || true`
done

We were also experiencing this issue after we upgraded to 0.11.1, in a reverse proxy container where consul-template is used in combination with haproxy. I can confirm that downgrading to 0.10.0 solved the issue for us (didn't have time to compile 0.11.1 with Go 1.4, sorry).

Hi @consultantRR

How did you get Consul Template to compile against Go 1.4.3? I'm getting errors from our gatedio library because of an API that only exists in go 1.5:

b.b.Cap undefined (type *bytes.Buffer has no field or method Cap)

@sethvargo:

I just spoke with my colleague who actually did the compile and his response was "it just compiled". :)

He did say it was compiled inside the 1.4.3 Golang official docker image, and that it looks like the gateio lib was added the day after he successfully compiled it.

If you remain unable to get it to compile, let me know and we'll see if we can get it working.

@consultantRR I just tried the golang 1.4.2 and 1.4.3 to no avail 😦

I'm going to release Consul Template v0.12.0 under Go 1.5.2 and then spend some time next week figuring out if there's anything we can do here.

@sethvargo I have tried under linux with gvm and compiling works fine.

gvm install go1.4.3
gvm use go1.4.3

@sielaq are you running the latest master with the most recent deps?

@sethvargo go get / go build always take the latest master by default.
so yes I have run against latest master.

I don't doubt your super skills, you are much better than me,
but If you have some different results, this could be some env issue,
you can try to run on any pure Ubuntu with gvm

bash < <(curl -s -S -L https://raw.githubusercontent.com/moovweb/gvm/master/binscripts/gvm-installer)
source "$HOME/.gvm/scripts/gvm"
gvm ...

My co-worker @kreisys and I have been seeing this problem too. One thing to note; if you run:

service haproxy reload & service haproxy reload & service haproxy reload & service haproxy reload & service haproxy reload

This will spawn a bunch of reloads of haproxy concurrently in the background. What we see happen is we get a whole bunch of haproxy instances all with the same -sf parameters, so there's a bunch of haproxy instances competing to try to take control of the socket and kill the old haproxy.

Is consul-template trying to run the same command multiple times concurrently? Could we fix this just by preventing this?

I had this issue as well with haproxy 1.5/1.6 and consul-template 0.12. We previously had no issues with consul-template 0.10. Recompiling 0.12 using golang 1.4 does seem to fix the problem. If it's helpful for anyone, here's my linux amd64 binary:

https://dl.dropboxusercontent.com/u/515268/consul-template-0.12-go1.4

Same here, I use consul-template 0.12.0 after recompiling it with go 1.4.3.
Works like a charm and no more problems with haproxy.

commented

A workaround is to use quiescence (not just in everyday conversation).

But seriously, a wait as low as 3 seconds seems sufficient to stop the rapid flip-flopping that generates multiple haproxy instances which I believe is exactly what this parameter is designed for.

{
  "consul": "localhost:8500",
  "wait": "3s"
}

Some thrash testing is required to prove it is 100%.

@kcd83 I don't think that solves the root problem, see golang/go#13164

commented

@pmbauer OK I agree, thanks for pointing that out

wait is does not address the root cause however changing the configuration of your typical application in a fraction of a second is probably not helpful

btw. as @ianlancetaylor mentioned seems like main issue was in golang signals
https://go-review.googlesource.com/#/c/18064/
that was very sophisticated bug

Just as an update here - it looks like this has been fixed in Go 1.6 (unreleased) and was a bug in Go itself.

I'm going to keep this issue open until Go 1.6 is officially released.

Thanks @sethvargo, this is great news.

Go 1.6beta is out,
you can also pull + build with its image from Dockerhub
to verify whether the bug still persists:
https://hub.docker.com/_/golang/

We encounter a similar problem with Docker/Consul-template on AWS EC2 instances, bringing the EC2 instance to 100% CPU. We use the 0.12.1 version. The debug message: Reaping child process 0.
Actually we are invoking nginx.

Our workaround is to add the -reap=false to the command, but we are not sure what the side effects are. Please advise.

We have the same issue with 0.12.0 Reaping child process 0. -reap=false can fix it but when diaseabling reap we are left with a lot of HAProxy zombies.

Edit: Actually, my error message is [ERR] (runner) error running command: wait: no child processes not Reaping child process 0. Sorry :/

Hi @rlcomte and @geniousphp setting -reap=false should work around the issue, though it may cause zombie processes to be left behind. I split this out into #507 so we can figure out what's going on, since it should be unrelated to this issue. Can you please provide any details about the environment where you see this over there?

I've tried rebuilding consul-template v0.13.0-dev (7c000ce) with go version go1.6beta2 linux/amd64, and I can no longer reproduce the issue of unreaped subprocesses. I am using consul-template to run haproxy in an Alpine container btw.

PS @sethvargo could the waiting-reply label be removed to reflect the issue's current status?

Until there's an official release from HashiCorp, I've a build of 0.12.2 with Go 1.6 beta2 rc1 rc2 release via Circle CI with included Dockerfile, etc, for those who would like to build it themselves. Some might find it useful.

I'd have used an older build, but I'd already started relying on 0.11 syntax and features 😅

Hello, same problem is resolved with consul template 0.12.2 and go 1.6rc1 for me. Thank you @duggan for your dockerfile!

I don't have much to add to the above. But we upgrade from 0.10.x to 0.12.2 today and we ended up with dozens of haproxies.

We tried 0.12.2 on CoreOS 835.12.0 using haproxy 1.6.3 without success. There are no errors in consul-template logs or haproxy logs. After a few hours our network service(Midokura) started to get overloaded with a large number of SYN requests. We rolled back to 0.10.2 and the network service(Midokura) recovered.

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:         eth0 111390.07      0.00 124128.80      0.00      0.00      0.00      0.68
Average:         eth1  18888.36      0.00   4970.44      0.00      0.00      0.00      0.68
Average:         eth2   1880.14   7096.23    121.18   9651.90      0.00      0.00      0.00
```22:23:30.261584 IP (tos 0x0, ttl 63, id 46818, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.51858 > 172.16.2.13.webcache: Flags [S], cksum 0x5c09 (incorrect -> 0x2383), seq 614922376, win 29200, options [mss 1460,sackOK,TS val 14044778 ecr 0,nop,wscale 7], length 0
22:23:30.261641 IP (tos 0x0, ttl 63, id 52396, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.33782 > 172.16.2.12.webcache: Flags [S], cksum 0x5c08 (incorrect -> 0x1013), seq 4101341894, win 29200, options [mss 1460,sackOK,TS val 14044778 ecr 0,nop,wscale 7], length 0
22:23:30.263228 IP (tos 0x0, ttl 63, id 22526, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.41282 > 172.16.1.199.webcache: Flags [S], cksum 0x5bc3 (incorrect -> 0xe280), seq 1848711573, win 29200, options [mss 1460,sackOK,TS val 14044779 ecr 0,nop,wscale 7], length 0

@brycechesternewman unless you used @duggan 's custom build against unreleased Go 1.6, the issue is still there.

go 1.6 was released today: https://golang.org/dl/#go1.6

@sethvargo can a new release of consul-template be cut please?

0.13.0 is released and compiled with go 1.6!

Our long national nightmare is over. :)

thank you @sethvargo !

I'm running the 0.13.0 release pulled from the following URL and still seeing occurrences of multiple HAProxy processes running after a service reload. Is anybody else still seeing this issue?

Yes, just today we found five extra haproxy's with 0.13.0. Clearly this
issue persists. And we have already upgraded production to 0.13.0, so I
guess we need to scramble to roll back.

On Mon, Mar 14, 2016 at 7:51 PM, Chris Dickson notifications@github.com
wrote:

I'm running the 0.13.0 release pulled from the following URL and still
seeing occurrences of multiple HAProxy processes running after a service
reload. Is anybody else still seeing this issue?

https://releases.hashicorp.com/consul-template/0.13.0/consul-template_0.13.0_linux_amd64.zip


Reply to this email directly or view it on GitHub
#442 (comment)
.

Barry Kaplan
barry@systeminsights.com
CIO, CISO
System Insights

@sethvargo Should I open a new issue to investigate the persistence of this bug, or can we reopen this one?

FWIW, 0.13.0 fixes the problem for me. haproxy reload is supposed to leave behind the previous previous process with the -sf [pid] flag to let the previous process finish out its requests. On a busy system that reloads frequently, it's expected to see some number of previous haproxy processes still running. If you refresh the ps command, you should see the haproxy -sf [pid] rotate PIDs.

Victor, it mostly fixed the problem for us as well. At least until 0.13 was
released to now. Iit can't be just that as the old haproxy's are still
handling requests. Well, at least they are issuing 503s since their view of
the services they are routing to is now invalid. And we have never ever
seen this problem prior to 0.12, so clearly something new is happening.

On Wed, Mar 16, 2016 at 1:25 AM, Victor Trac notifications@github.com
wrote:

FWIW, 0.13.0 fixes the problem for me. haproxy reload is supposed to leave
behind the previous previous process with the -sf [pid] flag to let the
previous process finish out its requests. On a busy system that reloads
frequently, it's expected to see some number of previous haproxy processes
still running. If you refresh the ps command, you should see the haproxy
-sf [pid] rotate PIDs.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#442 (comment)

Barry Kaplan
barry@systeminsights.com
CIO, CISO
System Insights

commented

We've added wait = "1s" to /etc/consul-template/config.d/consul-template.conf to work-around this problem for now.

Hi everyone,

Since this issue seems to be getting a lot of noise, I'm going to clarify that this has been fixed in Consul Template 0.13+. Please open any other concerns as separate issues. haproxy itself will leave behind previous processes for a short period of time to finish existing requests, so unless you see a multitude of haproxy processes running each time CT restarts the process, this issue does not apply to you 😄