log-api memory consumption

Question

log-api memory consumption

anyandrea opened this issue 4 years ago · comments

anyandrea commented 4 years ago

Thanks for your submission!

Here are some guidance for submitting your issue.

Tell us why you are submitting?

I found a bug - here are some steps to recreate it.
I have an idea for a new feature - please document as "As a user, I would
like to..."
I am having trouble getting setup, understanding documentation, or using
Loggregator in some way - please highlight any resources you are using.
This is an architecture change that will result in cleaner more efficient
code - Tell us why you think this is a good idea.

We review issues and PR's on a weekly basis and try to schedule and prioritize
items. If you are wondering about status of an item feel free to contact us on
our slack channel.

Problem description

API gets stuck, log-api uses a 100% of memory (side note - the log-api's are sized very big, 16 GB memory, 8 CPU)
The CC-API seemed to lock-up when having to query app logs via log-api / doppler
looking at the vm metrics dashboard in grafana for the log-api's seems to indicate they've used up all memory, there was also a drastic spike in cpu load exactly around that time they hit a 100% (being either the cause or an effect of the memory limit hit), looking at a long term graph it looks like the log-api vm's are memory leaking, or rather something causes them to take a bump in total memory consumption every day
Trying to reboot the Log-API VMs casued them to kinda stuck weirdly, not behaving properly / respecting the reboot, so a hard Poweroff/on was necessary

We assume that the rlp and/or rlp-gateway misbehave somehow.

additional_infos.txt

cf-gitbot · Answer 1 · Mon Apr 20 2020 19:36:59 GMT+0800 (China Standard Time)

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

Travis Patterson · Answer 2 · Wed Jun 24 2020 23:44:43 GMT+0800 (China Standard Time)

Hi @anyandrea! Do you know what version of Loggregator release you're on? Are you still experiencing this issue?

The latest versions releases when you submitted this issue were subject to a bug in golang 1.14.1 that caused processes to deadlock with high CPU. There's a chance upgrading the Loggregator Release version could help if you are still experiencing this issue.

anyandrea · Answer 3 · Sat Jun 27 2020 05:21:58 GMT+0800 (China Standard Time)

Hi @MasslessParticle ! Thank you for your response. We're currently using v106.3.9 and still see the issue.
If this would have been related to the bug you mentioned, would this then also have caused the crazy memory usage? Currently we need to restart the log-api VMs every second day to not run out of memory and lock up the API.

Travis Patterson · Answer 4 · Wed Jul 01 2020 23:56:29 GMT+0800 (China Standard Time)

Interesting. It looks like that version isn't susceptible to the golang bug, so we're good there. Which process is exhibiting high memory usage?

It's worth noting that you might see high cpu on v106.3.9. That version has a bug in Trafficcontroller where it makes hundreds or DNS queries per second resulting in high cpu.

The memory usage is still mysterious, though.

Jesse Weaver · Answer 5 · Tue Jul 14 2020 00:16:53 GMT+0800 (China Standard Time)

Would it be possible to log into those VMs when they're experiencing this issue and determine which process is using all of this memory? We haven't seen this elsewhere, and haven't been able to reproduce it ourselves.

renelehmann · Answer 6 · Tue Jul 14 2020 21:46:29 GMT+0800 (China Standard Time)

Hi @pianohacker . Its always the job "reverse_log_proxy_gateway" which continuously allocates more and more memory and does not free up until we restart it.
We see this on every cloudfoundry foundation we have but of course the growing memory usage depends on the usage and maybe how many connections are made.
The example graph above shows this from an environment with 15 log-api vm's and with 16GB memory, 8 cpu each.
Don't know if this matters but if we count the rlp-gateway connections to rlp we see around 300 connections on each vm
(lsof | grep rlp-gatew | grep -c 8082) on this environment.

Jon Price · Answer 7 · Thu Jul 16 2020 04:07:17 GMT+0800 (China Standard Time)

We are seeing the same behavior but only in our foundations that have higher usage, Symptoms are the same though, memory usage will grow until exhausted then the system won't reboot and a power cycle is required. Also on v106.3.9 .

Jon Price · Answer 8 · Thu Jul 16 2020 04:34:16 GMT+0800 (China Standard Time)

I just rebooted all of our log-api VMs today and it takes about a week for ours to exhibit the issue but I will try to grab a heap dump the next time it does.

renelehmann · Answer 9 · Fri Jul 17 2020 02:49:08 GMT+0800 (China Standard Time)

@MasslessParticle
Attached a heapdump (~76KB heap.pprof.gz) from the rlp-gateway on which the memory usage on the vm was around 80%.
This vm has been started due a deployment around 40hrs before.
After restarting of the "reverse_log_proxy_gateway" job only it starts usually on our cases with around 15% memory usage on the vm.

Travis Patterson · Answer 10 · Fri Jul 24 2020 00:38:56 GMT+0800 (China Standard Time)

This is super helpful, Thanks! We're trying to reproduce and root cause this, now.

Travis Patterson · Answer 11 · Wed Aug 12 2020 00:30:27 GMT+0800 (China Standard Time)

@anyandrea @jmprice @renelehmann
Hi folks, we've just cut a new release that should address this issue:

https://github.com/cloudfoundry/loggregator-release/releases/tag/v106.3.11

anyandrea · Answer 12 · Wed Aug 12 2020 16:31:20 GMT+0800 (China Standard Time)

@MasslessParticle this sounds great! Thank you for the fix, we'll integrate and test it as soon as possible 👍