microsoft / DurableFunctionsMonitor

A monitoring/debugging UI tool for Azure Durable Functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about activity durations on Gantt charts

scale-tone opened this issue · comments

Btw I have a question for you: I've noticed that the total run time for each activity on the Gantt chart is a sum of "waiting time" with "execution" time. Is there a way for you to show those timings separately?

I say that because on the servers I have a 10 minute limit, and I can see some taking 20 min. So, I suppose this is the sum of waiting time (scheduled) with actual execution time.

Originally posted by @junalmeida in #22 (comment)

@junalmeida, Gantt charts are generated out of data that DurableClient.GetStatusAsync() returns as execution history. Items in that array have two DateTime fields - Timestamp and ScheduledTime. Line on a Gantt chart represents the difference between these two.

If I read the code and the data correctly, then yes, ScheduledTime is the time when an activity invocation message was sent and Timestamp is the moment when the activity result message was sent.

Yes, there're some other timestamps stored in the XXXHistory table, which we could potentially use to make Gantt charts more meaningful. Problem is that there're no documentation on those, so we can only make assumptions on their exact meaning.

Would it be possible for you to provide a couple of examples of those lengthy activities, as they appear in your XXXHistory table? Some 'TaskScheduled' event (with all its timestamps) + some corresponding 'TaskCompleted' event (also with all its timestamps) - and how this activity appears on the 'Details' page and on the Gantt chart?

This is what I see. Let me know if the query itself looks right.

The eventId 72 (20m37s):
image
image

The eventId 68 (20m42s):
image
image

Thanks for that info, @junalmeida .
As we can see from your timestamps, these activities indeed took ~20 min to execute, and no indication of any "waiting time" or anything similar. So I don't see any way to make the Gantt chart look more informative in this case.

But you mentioned that "10 minute limit" - what is that? Is that the functionTimeout setting that you've set to 10 minutes? Or how exactly are you configuring that (there were in fact lots of older ways to configure function execution timeouts, and most of them do not work anymore)?

This function lives on the basic serverless consumption plan, which implies on a 10 limit execution. Anything beyond that throws FunctionTimeoutException.

That's why I don't think that 20 minute report is real, and to add, I also can't find any function reporting more than 10 min on application insights.

I can try to find this exact instance for you.

@junalmeida , the default timeout for Consumption plan is 5 minutes, not 10. This is why I was asking whether (and how) you're configuring a custom value for that timeout. 10 minutes is the maximum allowed value for that config setting.

When your activity exceeds that timeout (either default or explicitly configured), it indeed is supposed to end up with a FunctionTimeoutException, like this:
image
And it should be shown as failed (in red) on the Gantt chart.

How many instances like this do you have? Is it just one instance or many?

Are you sure this instance was actually run in Azure? Could it somehow happen that it was run on some devbox or any third-party machine (e.g. if your cloud environment and your devbox occasionally share the same Storage and the same TaskHub)?

It is configured for 10 on its host.json file

And I can see a handful of activities shown as ran by more than 10 minutes on Gantt chart.

And no, I do not share Dev execution env on prod taskhub

By far I'm unable to reproduce the behavior you described.

Can you elaborate on what platform/language/Functions version you're using?

Also, can you check that your instance is healthy by itself? Aka that there're no host crashes due to e.g. OutOfMemoryException and no any other weird effects?

This project is written on C#, .NET Core 3.1. Az Functions v3.0.13, Durable Functions v2.5.1

Instance is healthy, I have no complaints and all jobs seem to be working. I can see no OOM on App Insights.

Also checking reports on performance, the worst call I have on past 24 hours is 7 min which is valid within the 10 min limit.

A peek of host.json
image

OK, after switching Microsoft.Azure.WebJobs.Extensions.DurableTask from v2.1.1 to v2.5.1 I can confirm that I can reproduce it.
In my case it looks like this:
image

Aka seems like some of those parallel activities are being queued and picked up for execution only after some other are finished. Even though this isn't being indicated by their ScheduledTime values.

Will try to play with it a little more. E.g. will try the latest version (since there's a slim hope that this weird behavior was fixed with this commit or any other commit).

In the meanwhile, since it is definitely not a DfMon bug, can I ask you to raise an issue in azure-functions-durable-extension ?

I can confirm that the behavior is the same with latest v2.6.1.

But I can also confirm, that this behavior only takes place, when the activity method is synchronous (aka if it holds a thread).

If the method is asynchronous (e.g. marked with async keyword and is implemented accordingly), all its executions are cancelled correctly, after 10 minutes, resulting in TimeoutExceptions as expected.

So I suggest that you check that your activity methods are implemented as asynchronous (returning Tasks), since it is a best practice anyway, especially for methods that can take that long to execute.

I do not have any activity which is not async. All return Task.
IMO I believe this happens when the server is busy enough to not start an activity immediately after scheduled. But if they do not record the actual execution start time, then indeed it is a bug to durable extension itself. I'm just not sure how to explain that properly as I don't know all the inner workings of durable extension.

I do not have any activity which is not async. All return Task

But can it happen that they still hold a thread inside of them (by e.g. doing a Thread.Sleep() or something similar) ?

Hm no, I have no code that holds a thread intentionally. And the function you see above is not the first one that I've found "running" for longer than 10 min on the Gantt chart. This is also not consistent as some instances are pretty fast, some are not, that's why I'm inclined to think this happens when the server is busy.

This is another example, totally different function app, a different taskhub, also supposed to end within 10 min.

image

App Insights:
image

OK, so your orchestration actually starts lots of those activities. This is most likely the reason.

Because the default value for maxConcurrentActivityFunctions setting in Consumption mode is 10, and the auto-scaling mechanism is not responsive enough to those activities, so they get queued on the same single instance.

The solution is to set that maxConcurrentActivityFunctions setting to something more substantial.
Or else you could try to use another pricing tier, e.g. Premium. With your large amounts of executions this could also be potentially cheaper.

It looks like the duration is calculated using timestamp - scheduledTime instead of the actual function start time. This means that the time between the function getting scheduled and actually starting is counted. The function start value is something we should expose from the Durable Functions extension to make the calculation more accurate.

@scale-tone ATM we are not considering moving to premium plan because although we indeed fire lots of activities, there are not many orchestrations happening simultaneously, and we do not require a fast processing, so it is ok to wait a bit. I just feel like the Gantt chart could use more detailed reporting on "scheduled" vs "start time" as @bachuv just mentioned.

We are also not considering increasing maxConcurrentActivityFunctions as this number increases parallelism on the very same instance (instead of spawning more instances), which then can be worse because consumption plan instances are very basic machines.