cloudfoundry / loggregator-release

Cloud Native Logging

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rlp process consumes all memory and swap when binding 5000 syslog drains

jomora opened this issue · comments

Steps to reproduce:

  • cf-deployment version 2.6.0 with loggregator 102.2
  • CF deployment with enough diego-cells and org quota to push 5000 apps and create 5000 services (we used 22 diego-cells, 50 dopplers, 20 log-apis, 20 adapters)
  • Deploy any hello world app (it doesn't even have to produce any logs) 5000 times, e.g.
# -P runs 50 parallel pushs
seq 5000 | xargs -P 50 -I % cf push <app_name>_% > /dev/null
  • You need a syslog drain endpoint reachable from adapter vms e.g. on <SYSLOG_DRAIN_IP>:4434. We were using an ELK stack with multiple syslog drain endpoints, but in the context of this issue we faked a syslog endpoint which discards all data using e.g.
# sudo apt-get install nmap (ncat is part of nmap)
# accept up to 5010 connections, listening on port 4434
ncat -4 -k --listen -p 4434 -m 5010 > /dev/null

to prove the point that the issue is not with the syslog drain endpoint.

  • create user-provided services as per docs:
seq 5000 | xargs -P 200 -I % cf create-user-provided-service test_syslog_% -l syslog://<SYSLOG_DRAIN_IP>:4434
  • Create a 1-to-1 binding of service instances to apps, e.g. bin service_1 to app_1, service_2 to app_2, ...
seq 5000 |  xargs -P 200 -I % cf bind-service <app_name>_% test_syslog_%

Observed behaviour

The rlp process on the log-api vms consumes all memory and swap. Over time, some of the log-api vms fail and do not recover, i.e. swap and memory do not decrease even if you unbind the services.

See grafana screenshots in attached PDF
rlp-overload_final.pdf

We took PPROF dumps of rlp, see
dumps.tar.gz

Also see related discussion on slack:

According to this the scaling should be fine to take the load of 5000 apps/bindings.

Expected behaviour

rlp should not eat up all memory when binding the syslog drain services.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/159364096

The labels on this github issue will be updated when the story is started.

@jomora thank you for the very detailed report. At first glance of the pprof dumps it appears there may be a go-routine leak. I am going to use your scaling and drain count to spin up an environment to attempt to reproduce.

@jomora so far in our testing we have tested a scaled down version with 5 doppler's, 2 log-api's and 2 adapters with 500 drains. With this test we hit the RLP's max egress streams which caused a significant amount of logging on log-api and adapter VMs.

Would you happen to have logs available for us to look at for doppler, log-api, and adapter to see if could possibly be a similar issue where our balancing load from adapter's to RLP's is not sufficient?

We are in the process of scaling up our deployment for further testing.

@jomora there are a few outcomes from our scaled up test.

  • Once the adapters started to reach the RLP's max egress streams the adapter would basically DOS the RLP causing the high CPU on the log-api VMs. To work around this you can set reverse_log_proxy.egress.max_streams to 1000.
  • In the RLP when hitting the BatchedReceiver API we are leaking a channel which could account for the high memory usage.
  • We can greatly reduce the CPU usage of an idle stream in the Doppler by using a conditional mutex to put the goroutine to sleep when there is no data to send, and then wake up the goroutine when there is work to be done. Currently we have a tight loop that sleeps for 10ms if there are no envelopes and is a waste of CPU for idle streams.
  • Our connection pooling and retry logic in the adapter needs to do a better job of balancing the load and make smarter decisions if the RLP is at capacity.

The changes we made were just spikes to test out theories and have not yet been commited to the source code.

I am going to do some further testing to test my suspicions about the channel leak being the cause of high memory.

@jomora sorry for the delay in communication.

We have been doing a lot of testing around this and found another issue. When we increased the RLP's max egress streams to 1000 and applied some changes to the RLP and Doppler the CPU usage would improve greatly. However when we reverted the RLP's max egress streams back to 500 we would see the CPU for RLP and Doppler increase back to full CPU usage. We found that this was because when the RLP reaches the max egress streams it sends the gRPC error code for ResourceExhausted. However the syslog adapter was not properly handling this error and instead of just trying another gRPC connection, it would invalidate the connection the error was received on. This would cause all streams on that client connection to close and attempt to re-open on another connection.

We have fixes on the master branch of https://github.com/cloudfoundry/loggregator and are going to do some further validation.

We have delivered some fixes to the develop branches of loggregator-release and cf-syslog-drain-release. Those fixes will be available next time we cut releases.

@jomora any luck with this fix? I am going to accept our tracker story. Please re-open if it doesn't work for you.

@jomora is currently out of office but we will definitely do a test run using a bosh dev release of loggregator to see if your fixes help. it may take some time as the setup requires some effort but we'll report back here

@jsievers my bad. I'll schedule this for release ASAP (re-opening too).

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/160147908

The labels on this github issue will be updated when the story is started.

Hi, we tried your fixes (loggregator-dev, cf-syslog-drain-dev and loggregator-agent-dev) and they worked :)
We successfully deployed 5k apps with 5k bindings and each app was sending 20logs/seconds. The system (properly scaled) could handle the load without issues.

What we noticed though is that when the system was not properly scaled, 6 out of 20 adapter became unresponsive (=vms were completely broken), but bosh resurrector kicked in, deleted the vm and recreated it.
Unfortunately, I couldn't ssh into the VMs to get more details about it. Still, why did those VMs die instead of simply dropping the extra data they couldn't process?

When the adapters aren't properly scaled, we've noticed that I/O gets saturated and prevents the bosh director from contacting the vm. That's what prevents you from sshing to the machine and causes the BOSH resurrector to kick.