cloudfoundry / loggregator-release

Cloud Native Logging

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A flood of errors in RLP and Adapter

youngm opened this issue · comments

  • I am having trouble getting setup, understanding documentation, or using
    Loggregator in some way - please highlight any resources you are using.

After upgrading to Loggregator 101 (101.7&8 to be precise) and the syslog release 5.1 things seem to be working OK. Logs are draining. But, we're slightly concerned by a significant flood of errors in the RLP and Adapter that we don't understand.

Every 10 minutes we get a number of errors like the following. The number of errors increases with the number of syslog drain services bound to apps in the environment. We started a conversation in Slack that ended with this post. https://cloudfoundry.slack.com/archives/C02HCCXV5/p1521050087000867 Thought I was create this github issue since it is difficult to track this type of thing in Slack.

RLP:
Unable to connect to doppler ({dopper ip rotating between all of them}:8082): rpc error: code = Canceled desc = context canceled (~2,434,776 events an hour)
Error while reading from stream ({dopper ip rotating between all of them}:8082): rpc error: code = Canceled desc = context canceled (~4,121,338 events an hour)
Subscribe error: context canceled (~987,692 events an hour)

Adapter:
failed to open stream for binding {every app guid with a syslog drain ~1200 of them}: rpc error: code = Canceled desc = context canceled (~525,713 events an hour)
Subscriber read/write loop has unexpectedly closed: rpc error: code = Internal desc = transport is closing] (~424,855 events an hour)

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/156277242

The labels on this github issue will be updated when the story is started.

@youngm those log lines shouldn't be concerning. It's just normal behavior of gRPC for how we are using it. We are scheduling some work to remove these log lines (and convert them to counter metrics).

See also https://www.pivotaltracker.com/story/show/156054106

@ahevenor That's good to hear. So are these normal messages related to tearing down a connection or something? When this is turned into a metric what would this metric signify?

Today we have a deployment with ~1000 drains. And we're seeing several hundred thousand of these an hour on Adapter and Several million of these an hour on RLP.

Looks similar to cloudfoundry/loggregator#23
Was the log reduction already applied not enough? We still have to test the newest loggregator yet.

In our case the default log rotation of the VM was not able to keep up with the logs and filled up the disk.

@tyyko Yeah, similar issue. We don't have as many dopplers as you so it isn't filling up our disks. It doesn't appear that the fix for #23 is in the release we're using. We're using 101.9. The fix appears to be in the 102.x releases.

Although looking at the fix, of the errors we are seeing it appears it would only eliminate the Unable to connect to doppler error. We are seeing several more errors that are also apparently just noise. Perhaps the more errors we're seeing are caused more by heavy use of the Syslog release?

@youngm that's possible. I'll double check with the syslog-release team as well to see if that might be the case.

@youngm Those logs are during standard operation? Not during a deploy or anything that would cause all connections to doppler to roll? Also, how many dopplers do you have? How many RLPs?

I think that failed to open stream log is part of a new feature in cf-syslog-drain-release v6.1 called max_bindings. You're getting that log in v5.1?

@JohannaSmith Yes, those logs were during standard operation of syslog-drain 5.1 and loggregator 101.8. We had 4 each of rlps, adapters, and dopplers. We recently upgraded to syslog-drain 6.1 and added more rlps, adapters, and dopplers given the new connection limits added and that, along with upgrading to syslog-drain 6.1, has made the logs much better. Though there are still a few logs in adapter that appear to be noise: cloudfoundry-attic/cf-syslog-drain-release#16

I know that some work has gone into rlp also to trim some of these messages down. I Look forward to a new loggregator release that includes the reduced rlp logging.