A flood of errors in RLP and Adapter

Question

A flood of errors in RLP and Adapter

youngm opened this issue 6 years ago · comments

I am having trouble getting setup, understanding documentation, or using
Loggregator in some way - please highlight any resources you are using.

After upgrading to Loggregator 101 (101.7&8 to be precise) and the syslog release 5.1 things seem to be working OK. Logs are draining. But, we're slightly concerned by a significant flood of errors in the RLP and Adapter that we don't understand.

Every 10 minutes we get a number of errors like the following. The number of errors increases with the number of syslog drain services bound to apps in the environment. We started a conversation in Slack that ended with this post. https://cloudfoundry.slack.com/archives/C02HCCXV5/p1521050087000867 Thought I was create this github issue since it is difficult to track this type of thing in Slack.

RLP:
Unable to connect to doppler ({dopper ip rotating between all of them}:8082): rpc error: code = Canceled desc = context canceled (~2,434,776 events an hour)
Error while reading from stream ({dopper ip rotating between all of them}:8082): rpc error: code = Canceled desc = context canceled (~4,121,338 events an hour)
Subscribe error: context canceled (~987,692 events an hour)

Adapter:
failed to open stream for binding {every app guid with a syslog drain ~1200 of them}: rpc error: code = Canceled desc = context canceled (~525,713 events an hour)
Subscriber read/write loop has unexpectedly closed: rpc error: code = Internal desc = transport is closing] (~424,855 events an hour)

cf-gitbot · Answer 1 · Mon Mar 26 2018 23:45:35 GMT+0800 (China Standard Time)

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/156277242

The labels on this github issue will be updated when the story is started.

Adam Hevenor · Answer 2 · Tue Mar 27 2018 00:37:32 GMT+0800 (China Standard Time)

@youngm those log lines shouldn't be concerning. It's just normal behavior of gRPC for how we are using it. We are scheduling some work to remove these log lines (and convert them to counter metrics).

See also https://www.pivotaltracker.com/story/show/156054106

Mike Youngstrom · Answer 3 · Tue Mar 27 2018 00:42:09 GMT+0800 (China Standard Time)

@ahevenor That's good to hear. So are these normal messages related to tearing down a connection or something? When this is turned into a metric what would this metric signify?

Today we have a deployment with ~1000 drains. And we're seeing several hundred thousand of these an hour on Adapter and Several million of these an hour on RLP.

Michael Grifalconi · Answer 4 · Tue Mar 27 2018 15:43:01 GMT+0800 (China Standard Time)

Looks similar to cloudfoundry/loggregator#23
Was the log reduction already applied not enough? We still have to test the newest loggregator yet.

In our case the default log rotation of the VM was not able to keep up with the logs and filled up the disk.

Mike Youngstrom · Answer 5 · Tue Mar 27 2018 23:48:06 GMT+0800 (China Standard Time)

@tyyko Yeah, similar issue. We don't have as many dopplers as you so it isn't filling up our disks. It doesn't appear that the fix for #23 is in the release we're using. We're using 101.9. The fix appears to be in the 102.x releases.

Although looking at the fix, of the errors we are seeing it appears it would only eliminate the Unable to connect to doppler error. We are seeing several more errors that are also apparently just noise. Perhaps the more errors we're seeing are caused more by heavy use of the Syslog release?

Adam Hevenor · Answer 6 · Thu Mar 29 2018 05:34:06 GMT+0800 (China Standard Time)

@youngm that's possible. I'll double check with the syslog-release team as well to see if that might be the case.

Johanna Ratliff · Answer 7 · Tue May 22 2018 06:27:31 GMT+0800 (China Standard Time)

@youngm Those logs are during standard operation? Not during a deploy or anything that would cause all connections to doppler to roll? Also, how many dopplers do you have? How many RLPs?

I think that failed to open stream log is part of a new feature in cf-syslog-drain-release v6.1 called max_bindings. You're getting that log in v5.1?

Mike Youngstrom · Answer 8 · Tue May 22 2018 07:03:20 GMT+0800 (China Standard Time)

@JohannaSmith Yes, those logs were during standard operation of syslog-drain 5.1 and loggregator 101.8. We had 4 each of rlps, adapters, and dopplers. We recently upgraded to syslog-drain 6.1 and added more rlps, adapters, and dopplers given the new connection limits added and that, along with upgrading to syslog-drain 6.1, has made the logs much better. Though there are still a few logs in adapter that appear to be noise: cloudfoundry-attic/cf-syslog-drain-release#16

I know that some work has gone into rlp also to trim some of these messages down. I Look forward to a new loggregator release that includes the reduced rlp logging.