dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.

Home Page:https://asp.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ANCM PostStartCheck Failure

AdamRiddick opened this issue · comments

For reference: #41409 - I'm opening a new issue as I can't comment on the other since its been locked.

To summarize, we have a .Net core 3.1 application using the Out Of Process hosting model that is experiencing this issue intermittently due to a timeout during startup.

I am given to understand from #22507 that the process shouldn't be left in a broken state, and should be restarted as it is Out Of Process - however this isn't happening and we need to correct this using a manual restart.

Tagging @jkotalik and @adityamandaleeka from the above two issues.

Triage: based on the description it sounds like RFP (or something else) is preventing the app from restarting in this case even though we believe that shouldn't affect out-of-proc. We should check if out-of-proc recycling is also affected by RFP or if something else is buggy.

Thanks for contacting us.
We're moving this issue to the .NET 7 Planning milestone for future evaluation / consideration. Because it's not immediately obvious that this is a bug in our framework, we would like to keep this around to collect more feedback, which can later help us determine the impact of it. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@adityamandaleeka I see from the bot above this will be considered in future, however this is a real and live problem for us in production that seems increasingly more common - is there anything we can do here?

@AdamRiddick you can ignore the bot message for this one... I put it in the milestone so we remember to look into it during the .NET 7 cycle.

Because we don't have logging or other info, we're going to just investigate whether RFP affects the out-of-proc scenario as well (which we don't expect). If it's not RFP, it might be something else in your case that's preventing the app from restarting.

@adityamandaleeka Thanks for the clarification. Can you tell me what RFP is?

I'm happy to arrange a call to discuss if that will assist.

Sounds like there should be a message in your event log if this is the cause something like:

Application pool 'my-test-application-pool' is being automatically disabled due to a series of failures in the process(es) serving that application pool.

Hi @HaoK I've sifted through the event logs when this has occurred and we don't see any messages relating to rapid fail protection.

I don't see any weirdness when trying a new app that does the following:

Throws in startup, results in a similar event log entry:

image

But the app domain is still up, I tried this for many requests

Adding a sleep for 60 minutes results in an eventual startup timeout after 120seconds:

image

@AdamRiddick since you aren't getting hit by Rapid failure protection, unless you are able to give us some kind of repro that demonstrates the behavior you are seeing, where iis needs to be restarted, there's not much we can do, feel free to open a new issue if you are able to provide a repro to investigate further

@HaoK To clarify, did the process restart after the timeout? My understanding is it should - that's the situation we are in, and it is not restarting every time.

I appreciate the difficulties here. I'll try and reproduce standalone, are there any debugging options here that can help us understand why it isn't restarting? We've tried the ANCM tracing, but that doesn't appear to tell us (Unless I'm missing it ...)

@HaoK I've managed to get further with this and it now appears the application is being restarted when required, its just happening consistently due to an issue somewhere else - we're still investigating and will come back if we find evidence it is tied to the ANCM.

Thanks.