Severe CPU degradation after vizier-pem update

Question

Severe CPU degradation after vizier-pem update

jack-hernandez opened this issue 5 months ago · comments

Hi there, we've encountered severe degradation of all of our production services running in EKS, with the main symptom being incredibly high and persistent CPU usage across all processes which looks as though it may have been caused by an update to the vizier-pem service we deploy as part of New Relic's nri-bundle for Pixie. This component seems to automatically update itself when a new image version is released, which for us, happened on 13th December at 00:14 GMT with version 0.14.8. Immediately after this, the CPU across all of our services increased significantly:

After combing through the logs on our EKS nodes, the only thing we noticed was a [pem] <defunct> zombie process running as a child of /app/src/vizier/services/agent/pem/pem on one node. As such, we decided to remove Pixie altogether and found all of the services within our cluster gradually returned to normal, with CPU levels across all pods and nodes drastically reducing (note the screenshot below here is a different cluster with more nodes as we migrated away from the one above):

Is anyone able to advise on whether or not this is a known issue and what exactly might be causing this change in behaviour? We've been running Pixie for almost 2 years now without issue so it's very concerning that the latest image version update has caused such problems for us.

Pixie version: bundled with nri-bundle version 5.0.4
K8s cluster version: 1.26
Node Kernel version: 5.10.199-190.747.amzn2.x86_64

Kartik Pattaswamy · Answer 1 · Wed Jan 03 2024 03:31:22 GMT+0800 (China Standard Time)

Hey @jack-hernandez, I don't believe this is a known issue. Could you please provide information on the types of services you're running as well as PEM logs and any other guidance to reproduce this issue would really help.

jack-hernandez · Answer 2 · Wed Jan 03 2024 21:36:31 GMT+0800 (China Standard Time)

Hi @kpattaswamy, most of our services running are PHP 7.4 using apache, supervisord or crontab. We install this on top of an Ubuntu 20.04 base image. We don't have Pixie running on production anymore but I captured some pem logs (attached) from our old cluster which does still have it running (though the cluster is not in use now, so not sure how helpful this will be).

Just to also provide additional context to the symptoms we saw, we noticed an increase in CPU on each individual process running within our application services, particularly when PHP code was being executed. It looks as though the CPU degradation happens very gradually over time and will continue to climb until everything comes to a halt. The same happened when we removed Pixie, and noticed it took a few hours for everything to eventually settle back down. It's also worth noting that even a rollout restart of pods with high cpu utilisation didn't alleviate the problem, they would come straight back up with high CPU rather than climbing from zero again (they were also not being throttled by any CPU limits, or processing anything more compute heavy than we'd expect).

We carried out other troubleshooting steps, working with AWS and other support partners to investigate our nodes, application changes and metrics, networking (specifically DNS), and our apache configuration, but the only thing that seems to have any real correlation with the behaviour we've seen is this change in the vizier-pem image version. We've also performed stress tests with and without Pixie and there is a clear impact on performance. I'm currently in the process of trying to replicate on a simplified cluster so I'll send any findings through once done.

nri-bundle__vizier-pem-dhc2z__pem.log
nri-bundle__vizier-pem-rtmn5__pem.log
nri-bundle__vizier-pem-v4bpm__pem.log