microsoft / CLRInstrumentationEngine

The CLR Instrumentation Engine is a cooperation profiler that allows running multiple profiling extensions in the same process.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Azure App Service: Access Violation in large method being instrumented

mellamokb opened this issue · comments

I've been trying to understand an issue that only shows up in our production system under heavy load. Our application is hosted in Azure App Services, which is currently on Instrumentation Engine v 1.0.39.

We get Access Violation on a routine basis which causes a crash of the service. I enabled the automatic crash dumps and captured a few of them. In looking at a dump in WinDbg, I'm seeing an extremely large output of !dumpstack (10,732 lines), which seems to top out in a stack overflow. 99% of the entries in the stack are references to MicrosoftInstrumentationEngine_x64. I thought it might be related to this issue: #307, but being version 1.0.39, this fix should already be included. I may not entirely understand what I'm seeing as I'm a novice of WinDbg :)

We are doing sampling so I assume the issue is random partly because when it happens to instrument the right method exactly. I'm assuming I should be able to repro it in a development environment by manually instrumenting the affected method in the stack trace?

Thanks!

Do you have a recent copy of Visual Studio? It has better facilities for showing stack overflows in dumps.

Otherwise, if you need to use windbg, make sure that the Microsoft Symbol Server is configured so that you can see what the actual failing frame is: https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/symbol-path. We might be able to track down the problem better with that info.

We are on latest version of VS 2019. When I open the crash dump in VS 2019 it renders as an Access violation exception (0xC0000005). The call stack doesn't look very deep but this is the top of it. The bottom method is the one from our code base.

image

image

I'm afraid that I can't tell very much from that call stack. I can't remember if we have the symbols for this dll uploaded to the public symbol servers or not. Can you try this:

A) Disable "Just My Code"

  1. Select "Tools > Options > Debugging > General
  2. Disable "Just My Code"

B) Enable Microsoft Symbol Servers

  1. Select "Tools > Options > Debugging > Symbols"
  2. Enable "Microsoft Symbol Servers"
  3. Select "Load all modules, unless excluded"
  4. Select Load all modules.

Try looking at the callstack again. If there are no function names next to the CLRIE dll, can you do the following:

  1. Open the modules Window (Debug > Windows > Modules)
  2. Find MicrosoftInstrumentationEngine_x64.dll
  3. Copy the information in the "Address" column into this chat.

I need to know the address that the module was loaded into in order to decode the frame addresses.

Address: 00007FFD58120000-00007FFD58276000

Also if it helps here is the top section of the k command in WinDbg

0:065> k
 # Child-SP          RetAddr               Call Site
00 00000086`b81815c8 00007ffd`6a3e9a11     clr!DontCallDirectlyForceStackOverflow+0x10
01 00000086`b81815d0 00007ffd`6a22fdcf     clr!CLRVectoredExceptionHandler+0xa8
02 00000086`b8181630 00007ffd`76ae6b30     clr!CLRVectoredExceptionHandlerShim+0xa3
03 00000086`b8181660 00007ffd`76ab46bb     ntdll!RtlpCallVectoredHandlers+0x104
04 00000086`b8181700 00007ffd`76b2987a     ntdll!RtlDispatchException+0x6b
05 00000086`b8181e00 00007ffd`581852d8     ntdll!KiUserExceptionDispatch+0x3a
06 00000086`b8182518 00007ffd`58180a13     MicrosoftInstrumentationEngine_x64!GetInstrumentationEngineLogger+0x4cf38
07 00000086`b8182530 00007ffd`5817e25a     MicrosoftInstrumentationEngine_x64!GetInstrumentationEngineLogger+0x48673
08 00000086`b8182550 00007ffd`581b9032     MicrosoftInstrumentationEngine_x64!GetInstrumentationEngineLogger+0x45eba
09 00000086`b8182590 00007ffd`5818abd0     MicrosoftInstrumentationEngine_x64!GetInstrumentationEngineLogger+0x80c92
0a 00000086`b81825e0 00007ffd`58189b15     MicrosoftInstrumentationEngine_x64!GetInstrumentationEngineLogger+0x52830
0b 00000086`b8182610 00007ffd`76b29d53     MicrosoftInstrumentationEngine_x64!GetInstrumentationEngineLogger+0x51775
0c 00000086`b81826f0 00007ffd`58165f22     ntdll!RcConsolidateFrames+0x3
0d 00000086`b81fc390 00007ffd`6a5e8cef     MicrosoftInstrumentationEngine_x64!GetInstrumentationEngineLogger+0x2db82
0e 00000086`b81fc7b0 00007ffd`6a47fd1e     clr!EEToProfInterfaceImpl::GetReJITParameters+0x90
0f 00000086`b81fc810 00007ffd`6a0d2c40     clr!ReJitManager::DoReJitIfNecessaryWorker+0x3ad0be
10 00000086`b81fc910 00007ffd`6a0d1bcc     clr!MethodDesc::DoPrestub+0x8f6
11 00000086`b81fcb30 00007ffd`6a0c4835     clr!PreStubWorker+0x3cc
12 00000086`b81fce70 00007ffd`11adad86     clr!ThePreStub+0x55
13 00000086`b81fcf20 00007ffd`121aaebb     0x00007ffd`11adad86

Hmm... things don't seem to be matching up. Are you certain that it is version 1.0.39? Can you post a screenshot of the whole line of info from the modules window?

Indeed. I had based it on the folder name being 1.0.39, but looking closely at the timestamp, it looks like it has instead picked up version 1.0.29 from November 2019. Well that would potentially explain a lot of issues... but then the question is how did it pick up version 29 and call it 39?

image

OK, what is going on here is that there is a structured exception occurring (could be an AV) during rejit of a method. That exception is getting caught by the profiler manager, and it is attempting to log an error. That error logging is causing another AV.

There were a couple of fixes around access violations after 1.0.29. Is it possible to update the copy of CLRIE that you are using?

Or, with a little more inspection, it looks like there is a structured exception that is getting caught during rejit, which is attempting to be logged, but there isn't enough stack space to make the call (exception thrown in __chkstk). So, the original error may have been a stack overflow. Regardless, it is probably best to update CLRIE if you can.

I mean, it is obvious enough that CLRIE should be updated. But I have no idea how it is being linked to our application. We're using a managed service (Azure App Service). I didn't create that SiteExtensions folder, nor install 1.0.39 folder and put 1.0.29 version dll in it. I'm under the impression that this folder and library is wholly managed by the Azure Infrastructure. I wasn't even aware that CLRIE existed until 2 days ago, because it is not intentionally linked to our deployed application. The only connection I'm aware of is Application Insights, which I have enabled from the Azure Portal a few years ago. We do not manually instrument anything in our application beyond what Application Insights automatically captures.

Are you suggesting I update that dll in D:\Program Files (x86)\SiteExtensions\ folder and hope it sticks? Or add an explicit reference to CLRIE into our application and hope it uses that instead of some GAC'd version? Is there anyone on the Azure App Service team that can offer insight into this? Is it not possible that thousands of Azure App Service installs out there are suffering the same problem because of some back-end auto-update system in Azure App Service deploying the wrong version of the .dll? I'm really at a loss as to what the correct solution is here.

OK, as a test I created a brand new app service plan from scratch and uploaded a test blank mvc app service to it. When I browse the files in Kudo, it has the same bad dll in the 1.0.39 folder. So at this point, it looks like a mistake on the Azure App Service infrastructure side that needs corrected ASAP.

I'll see if I can find some contacts in Azure App Services to see what is going on here.

In the meantime, if you have the new application, and there is no customer data in it, would it be possible to share a heap dump with us of the crash? You might also have logs in the Windows Event Log that can help us to track it down to make sure that it was a previously known issue.

Sorry, I did not have a crash in the test application. I only verified that the wrong dll is deployed. I'm not sure how to force a crash, as it does only seem to happen under very high load and memory pressure, and then only rarely and randomly. We are also researching reducing the memory pressure in case that is part of the problem with the application.

Thanks!

Memory pressure does seem to be the problem. The crash is happening because your application can't reserve additional space for the stack to grow anymore, and the error is getting picked up by __chkstk. Not all dlls have this check compiled into them, but CLRIE does. That might be why it is getting picked up by CLRIE more often.

Well I have been able to address at least one potential cause for high memory pressure. It still concerns me that Azure App Service is using the wrong dll, but if you are circulating that with the Azure App Service team hopefully that can be resolved separately. Closing the issue. Thanks for your time and effort!

Related issue with 1.0.29 dlls found in 1.0.39 #396