Large CPU impact when running continuous profiler for dozens of applications

Question

Large CPU impact when running continuous profiler for dozens of applications

jbparker opened this issue 10 months ago · comments

Describe the bug
When running the continuous profiler in dozens of Windows Services running on a VM, the profiler runs idle with materially (~+25%) more CPU resources than with the profiler off.

When using the strategy described here to allow CLR profiling to occur for tracing without the DD CP using DD_PROFILING_ENABLED=0 and COR_ENABLE_PROFILING=1 (since we are using netframework), the CPU impact is minimal (~+1%).

To Reproduce
Steps to reproduce the behavior:

Enable DD_PROFILING_ENABLED=1 and COR_ENABLE_PROFILING=1 on a few dozen .NET Framework 4.8 application running in a windows service
Let the application startup/run to cold-start & get a baseline
Run CPU percentage metrics from DD, Azure Monitor, etc
The CPU usage will be markedly higher than without profiling

Expected behavior
We expect some level of performance hit here - somewhere in the level of 5-10% given this comment and this note from Reduce overhead when using the profiler:

The different profile types have a fixed CPU and memory overhead, so the more profiled applications, the higher the overhead.

But, 25-30% seems excessive. This is especially the case when using older versions, as we experienced the same issue as #3625 where the CPU and open handles would stair and eventually lock up a VM entirely.

Overall, we need something that matches what the profiler's stated aim is so that we can run it in production without worrying that DD (and not our own code) is indeed responsible for performance problems:

Low impact in production
Continuous profiler runs in production across all services by leveraging technologies such as JDK Flight Recorder to have minimal impact on your host’s CPU and memory usage.

Screenshots

CPU view from Azure Monitor showing CPU Percentage for the instrumented applications

Runtime environment (please complete the following information):

Instrumentation mode: automatic with msi installer along with manual with NuGet package for custom instrumentation
Tracer version: 2.35.0 msi, 2.35.0 NuGet package
OS: Windows Server 2019
CLR: NET Framework 4.8
23 Windows Service applications running on a VM

Additional context
We love the detail that the CP gives you without having to instrument so much of the code. If we need to plan on only enabling it via deployment when there is a problem in the application, that would be a significant loss and would prevent us from utilizing DD to the extent that we'd like to replace virtually existing observability platforms.

Gregory LEOCADIE · Answer 1 · Wed Aug 09 2023 22:05:18 GMT+0800 (China Standard Time)

Hello @jbparker

This is known issue and we do not recommend enabling the profiler for multiple applications on the same host. We have a section Avoid enabling the profiler machine-wide in the documentation https://docs.datadoghq.com/profiler/profiler_troubleshooting/dotnet/?tab=windows that you can replace IIS applications by .NET applications in general (sorry for the not that clear documentation)

That said, I would like to say that's something we would like to address in the future, because as you would imagine, you are not the first one reporting this to us.

I have question(s) and maybe a lead to lower the CPU consumption:

What are the enabled profilers ? did you have to set specific environment variables to enable profilers ? (ex Exception profiler...)
What kind of profiles are you interested in for your application ? CPU, Walltime, Exception...
What detail do you refer to ?

The main offender in the .NET profiler is the WallTime profiler (every 16ms, we collect the stack of 5 threads). If this profiling data are not that useful to you, you could disable it by setting the environment variable DD_PROFILING_WALLTIME_ENABLED to 0.

Explanation about the observed overhead:
For one application, the profiler will consume ~800ms-2s (depending on the depth of stacks) of CPU per minute. In you case, you have 23 applications with the profiler attached. This means that the profiler overhead on the machine is 18.4s ~=> 30% CPU overhead.
From what I see on the graph you sent, the application does not use the CPU that much which in the end will make the profiler the main CPU consumer.

Jordan Parker · Answer 2 · Wed Aug 09 2023 22:31:00 GMT+0800 (China Standard Time)

Got it, thanks @gleocadie.

I'll definitely try with setting DD_PROFILING_WALLTIME_ENABLED to 0. It's certainly useful information, but as long as we can kind of infer overall impact via CPU time profiler, we're likely okay. I'll try enabling it and seeing what we get to know if my assumption on that is right.

Thanks for the details & breakdown on the overhead. Assuming we have drastically less CPU consumption with wall-time profiler disabled, this will be a very decent spot to land in.

I'll follow up shortly with any further questions or if we can just close this one out.

Jordan Parker · Answer 3 · Thu Aug 10 2023 00:08:48 GMT+0800 (China Standard Time)

@gleocadie was able to put this in - CPU idle went from 5% with COR_ENABLE_PROFILING=1 only to ~11% with DD_PROFILING_WALLTIME_ENABLED set to 0:

Not bad at all.

That said, I'm trying to figure out what exactly in the UI this might show. The method-level detail seems to have disappeared from both "Code Hotspots" and the Profiles "Flame graph" (although I guess expecting CPU overhead alone to drive that was a bit of a stretch).

Is it fair to say that code we write can't have method-level CPU time in a way that might show detail similar to what shows in "Code Hotspots"? If so, I think we may be better off disabling this entirely until the possibility of this running a little less "hot" becomes a possibility.

Gregory LEOCADIE · Answer 4 · Thu Aug 10 2023 15:20:40 GMT+0800 (China Standard Time)

👋 @jbparker
🤔 I see. So, at the time of the profiler collects a thread callstack, if there no span context associated you won't see anything which means that if your application does not a lot of work, the profiler might miss callstack/span context information.

We could change some other settings if you would like to keep the walltime/code hotspot and lowering the profiler overhead in the meantime. First remove the environment variable DD_PROFILING_WALLTIME_ENABLED (Walltime is enabled by default).

if you are interested in threads that have span context only (codehotspot), you can set the environment variable DD_INTERNAL_PROFILING_WALLTIME_THREADS_THRESHOLD to 0. In this case, the walltime profiler is enabled, the collection of threads callstack with span context is enabled but the collection of threads callstack with no span context is disabled.
if the overhead is still a bit high you can change the value of the environment variable DD_INTERNAL_PROFILING_CODEHOTSPOTS_THREADS_THRESHOLD. It's the number of threads callstacks with span context the profiler collects. By default the value is 10. The profiler keeps track of the threads with span context and at the time of collection, it will get the callstack of X threads. X being the min(nb_threads_with_context, 10). This can also be impactful, you can play with this setting too.