cloudfoundry / loggregator-release

Cloud Native Logging

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TrafficController fails resolving CC via consulwhen the instance is being stop

keymon opened this issue · comments

Tell us why you are submitting?

  • I found a bug - here are some steps to recreate it.

What?

During stop, bosh does monit stop all. The services would stop in any order, so consul-agent might stop before trafficcontroller. If consul-agent is installed locally for service discovery via DNS of the CC TLS endpoint, trafficcontroller might fail before stopping.

Detailed context

We are experiencing some errors in our platform during deployments:

    Fetching detailed app information:
        Failed to fetch app stats: Error requesting app stats: cfclient: error (200002): CF-StatsUnavailable (1 failures)

We correlated these errors to the moment VMs are being stopped during deployments, and we found that traffic controller fails with the message:

2018/01/23 15:13:28 Could not get app information: [Get https://cloud-controller-ng.service.cf.internal:9023/internal/v4/log_access/6ba02750-f073-444f-ba73-ee3cf4a02ec6: dial tcp: lookup cloud-controller-ng.service.cf.internal on 10.0.0.2:53: no such host

That is because the local consul-agent has been stopped before trafficcontroller

Expected behaviour

The trafficcontroller shall drain all connections and stop accepting new ones before the consul-agent is stopped.

Proposed solutions

We would like to discuss two alternative solutions for this problem:

  1. Create a monit dependency of the trafficcontroller monit job, with the consul-agent. It might require a new property to the TC job to specific dependencies.
  2. Add a drain script to TC: it shall message the TC controller to enter in drain mode (i.e. to update a healthcheck endpoint to get out of any load balancer), and wait for some safe period.

Thoughts?

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/154591053

The labels on this github issue will be updated when the story is started.

@keymon We've merged a fix, can you check to see if this resolves the issue?

Yes, it does, thank you! :)