cloudfoundry / loggregator-release

When doing the deploy to upgrade as soon as the api vms start to update we get failure using cf logs.

There is a long gap until the log-api is finally updated before cf logs is working again. In some of our production envs this could be as much as several hours due to the large number of gorouters which are updated between api and log-api.

We see the following errors in the CloudController logs.

api/189f002f-8cc0-4210-acb7-5401be60dcde:/var/vcap/sys/log/nginx_cc# tail -f nginx.access.log | grep 0eecf6e0
api.au-syd.bluemix.net - [30/Jul/2018:00:36:29 +0000] "GET /internal/log_access/0eecf6e0-4483-47d9-bfa6-7de663e756e3 HTTP/1.1" 404 423 "-" "Go-http-client/1.1" 104.97.78.58, 10.63.26.51, 168.1.45.84 vcap_request_id:650d2bb9-67b5-47f2-4e12-eb9ac68e9712::e55ec39e-d517-4c20-9a7e-22e31e569261 userid:38b0cd23-8d5c-4b1f-8b55-017d9ee23a83 clientid:cf response_time:0.008

and in the "old" traffic controller logs ...

==> /var/vcap/sys/log/loggregator_trafficcontroller/trafficcontroller.log <==
2018/07/30 00:32:30 Non 200 response from CC API: 404 for 0eecf6e0-4483-47d9-bfa6-7de663e756e3

The upgrade is from CF release (270) (Loggregator 92) to cf-deployment 1.19 (loggregator 102)

We need a process to be able to do this upgrade with no failure in the cf logs command. Will the loggregator 102 work with the old cloud controller? So if we upgrade loggregator first will this make think compatible?

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/159456723

The labels on this github issue will be updated when the story is started.

so after doing a bunch more digging I did find that in Loggregator 92 I see this ...

loggregator-release/src/code.cloudfoundry.org/loggregator/trafficcontroller/internal/auth/log_access_authorizer.go

Line 32 in 305cd0c

req, _ := http.NewRequest("GET", apiHost+"/internal/log_access/"+target, nil)

which is calling the /internal/log_access/<guid> this was REMOVED from CC and therefore we get the errors.

We are attempting to do our upgrade by

disabling access control
doing the deploy to see if we have no log loss.

Is there a better suggestion? Was there some documentation that we could not upgrade from 92 to cf-deployment 1.19?

@andrew-edgar

Your suggested upgrade strategy should work. However, when disabling access control you also allow access to the firehose (with all app logs) to anybody. If that is not a problem in your environment, you should be all good.

You already got to the root cause, but just to confirm: The underlying issue you are observing is caused in the change of authorization endpoint on the Cloud Controller. In an older version (compatible with loggregator v92) the endpoint was located at /internal/log_access/. In the current version (supported by loggregator => v93) it is located at /v4/internal/log_access/.

There is tooling and documentation to support the upgrade from cf-release -> cf-deployment located here. For orchestrating an upgrade with less downtime, the folks who manage cf-deployment may know more.

CF logs fails when upgrading CF from CF270 to CF-Deployment 1.19