cloudfoundry / loggregator-release

Cloud Native Logging

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

log-api memory consumption

anyandrea opened this issue · comments

Thanks for your submission!

Here are some guidance for submitting your issue.

Tell us why you are submitting?

  • I found a bug - here are some steps to recreate it.
  • I have an idea for a new feature - please document as "As a user, I would
    like to..."
  • I am having trouble getting setup, understanding documentation, or using
    Loggregator in some way - please highlight any resources you are using.
  • This is an architecture change that will result in cleaner more efficient
    code - Tell us why you think this is a good idea.

We review issues and PR's on a weekly basis and try to schedule and prioritize
items. If you are wondering about status of an item feel free to contact us on
our slack channel.

Problem description

  • API gets stuck, log-api uses a 100% of memory (side note - the log-api's are sized very big, 16 GB memory, 8 CPU)
  • The CC-API seemed to lock-up when having to query app logs via log-api / doppler
  • looking at the vm metrics dashboard in grafana for the log-api's seems to indicate they've used up all memory, there was also a drastic spike in cpu load exactly around that time they hit a 100% (being either the cause or an effect of the memory limit hit), looking at a long term graph it looks like the log-api vm's are memory leaking, or rather something causes them to take a bump in total memory consumption every day
  • Trying to reboot the Log-API VMs casued them to kinda stuck weirdly, not behaving properly / respecting the reboot, so a hard Poweroff/on was necessary

We assume that the rlp and/or rlp-gateway misbehave somehow.

memory_leak_log_api (1)
additional_infos.txt

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

Hi @anyandrea! Do you know what version of Loggregator release you're on? Are you still experiencing this issue?

The latest versions releases when you submitted this issue were subject to a bug in golang 1.14.1 that caused processes to deadlock with high CPU. There's a chance upgrading the Loggregator Release version could help if you are still experiencing this issue.

Hi @MasslessParticle ! Thank you for your response. We're currently using v106.3.9 and still see the issue.
If this would have been related to the bug you mentioned, would this then also have caused the crazy memory usage? Currently we need to restart the log-api VMs every second day to not run out of memory and lock up the API.

Interesting. It looks like that version isn't susceptible to the golang bug, so we're good there. Which process is exhibiting high memory usage?

It's worth noting that you might see high cpu on v106.3.9. That version has a bug in Trafficcontroller where it makes hundreds or DNS queries per second resulting in high cpu.

The memory usage is still mysterious, though.

Would it be possible to log into those VMs when they're experiencing this issue and determine which process is using all of this memory? We haven't seen this elsewhere, and haven't been able to reproduce it ourselves.

Hi @pianohacker . Its always the job "reverse_log_proxy_gateway" which continuously allocates more and more memory and does not free up until we restart it.
We see this on every cloudfoundry foundation we have but of course the growing memory usage depends on the usage and maybe how many connections are made.
The example graph above shows this from an environment with 15 log-api vm's and with 16GB memory, 8 cpu each.
Don't know if this matters but if we count the rlp-gateway connections to rlp we see around 300 connections on each vm
(lsof | grep rlp-gatew | grep -c 8082) on this environment.

We are seeing the same behavior but only in our foundations that have higher usage, Symptoms are the same though, memory usage will grow until exhausted then the system won't reboot and a power cycle is required. Also on v106.3.9 .

image

I just rebooted all of our log-api VMs today and it takes about a week for ours to exhibit the issue but I will try to grab a heap dump the next time it does.

@MasslessParticle
Attached a heapdump (~76KB heap.pprof.gz) from the rlp-gateway on which the memory usage on the vm was around 80%.
This vm has been started due a deployment around 40hrs before.
After restarting of the "reverse_log_proxy_gateway" job only it starts usually on our cases with around 15% memory usage on the vm.

This is super helpful, Thanks! We're trying to reproduce and root cause this, now.

@MasslessParticle this sounds great! Thank you for the fix, we'll integrate and test it as soon as possible 👍