Question about activity durations on Gantt charts

Question

Question about activity durations on Gantt charts

scale-tone opened this issue 2 years ago · comments

Konstantin Lepeshenkov commented 2 years ago

Btw I have a question for you: I've noticed that the total run time for each activity on the Gantt chart is a sum of "waiting time" with "execution" time. Is there a way for you to show those timings separately?

I say that because on the servers I have a 10 minute limit, and I can see some taking 20 min. So, I suppose this is the sum of waiting time (scheduled) with actual execution time.

Originally posted by @junalmeida in #22 (comment)

Konstantin Lepeshenkov · Answer 1 · Sat Mar 19 2022 02:57:19 GMT+0800 (China Standard Time)

@junalmeida, Gantt charts are generated out of data that DurableClient.GetStatusAsync() returns as execution history. Items in that array have two DateTime fields - Timestamp and ScheduledTime. Line on a Gantt chart represents the difference between these two.

If I read the code and the data correctly, then yes, ScheduledTime is the time when an activity invocation message was sent and Timestamp is the moment when the activity result message was sent.

Yes, there're some other timestamps stored in the XXXHistory table, which we could potentially use to make Gantt charts more meaningful. Problem is that there're no documentation on those, so we can only make assumptions on their exact meaning.

Would it be possible for you to provide a couple of examples of those lengthy activities, as they appear in your XXXHistory table? Some 'TaskScheduled' event (with all its timestamps) + some corresponding 'TaskCompleted' event (also with all its timestamps) - and how this activity appears on the 'Details' page and on the Gantt chart?

Marcos Junior · Answer 2 · Sat Mar 19 2022 03:42:36 GMT+0800 (China Standard Time)

This is what I see. Let me know if the query itself looks right.

The eventId 72 (20m37s):

The eventId 68 (20m42s):

Konstantin Lepeshenkov · Answer 3 · Sun Mar 20 2022 01:54:15 GMT+0800 (China Standard Time)

Thanks for that info, @junalmeida .
As we can see from your timestamps, these activities indeed took ~20 min to execute, and no indication of any "waiting time" or anything similar. So I don't see any way to make the Gantt chart look more informative in this case.

But you mentioned that "10 minute limit" - what is that? Is that the functionTimeout setting that you've set to 10 minutes? Or how exactly are you configuring that (there were in fact lots of older ways to configure function execution timeouts, and most of them do not work anymore)?

Marcos Junior · Answer 4 · Sun Mar 20 2022 02:36:45 GMT+0800 (China Standard Time)

This function lives on the basic serverless consumption plan, which implies on a 10 limit execution. Anything beyond that throws FunctionTimeoutException.

That's why I don't think that 20 minute report is real, and to add, I also can't find any function reporting more than 10 min on application insights.

I can try to find this exact instance for you.

Konstantin Lepeshenkov · Answer 5 · Mon Mar 21 2022 00:14:36 GMT+0800 (China Standard Time)

@junalmeida , the default timeout for Consumption plan is 5 minutes, not 10. This is why I was asking whether (and how) you're configuring a custom value for that timeout. 10 minutes is the maximum allowed value for that config setting.

When your activity exceeds that timeout (either default or explicitly configured), it indeed is supposed to end up with a FunctionTimeoutException, like this:

And it should be shown as failed (in red) on the Gantt chart.

How many instances like this do you have? Is it just one instance or many?

Are you sure this instance was actually run in Azure? Could it somehow happen that it was run on some devbox or any third-party machine (e.g. if your cloud environment and your devbox occasionally share the same Storage and the same TaskHub)?

Marcos Junior · Answer 6 · Mon Mar 21 2022 00:19:59 GMT+0800 (China Standard Time)

It is configured for 10 on its host.json file

And I can see a handful of activities shown as ran by more than 10 minutes on Gantt chart.

And no, I do not share Dev execution env on prod taskhub

Konstantin Lepeshenkov · Answer 7 · Mon Mar 21 2022 01:01:56 GMT+0800 (China Standard Time)

By far I'm unable to reproduce the behavior you described.

Can you elaborate on what platform/language/Functions version you're using?

Konstantin Lepeshenkov · Answer 8 · Mon Mar 21 2022 01:07:13 GMT+0800 (China Standard Time)

Also, can you check that your instance is healthy by itself? Aka that there're no host crashes due to e.g. OutOfMemoryException and no any other weird effects?

Marcos Junior · Answer 9 · Tue Mar 22 2022 06:04:16 GMT+0800 (China Standard Time)

This project is written on C#, .NET Core 3.1. Az Functions v3.0.13, Durable Functions v2.5.1

Instance is healthy, I have no complaints and all jobs seem to be working. I can see no OOM on App Insights.

Also checking reports on performance, the worst call I have on past 24 hours is 7 min which is valid within the 10 min limit.

A peek of host.json

Konstantin Lepeshenkov · Answer 10 · Wed Mar 23 2022 01:54:57 GMT+0800 (China Standard Time)

OK, after switching Microsoft.Azure.WebJobs.Extensions.DurableTask from v2.1.1 to v2.5.1 I can confirm that I can reproduce it.
In my case it looks like this:

Aka seems like some of those parallel activities are being queued and picked up for execution only after some other are finished. Even though this isn't being indicated by their ScheduledTime values.

Will try to play with it a little more. E.g. will try the latest version (since there's a slim hope that this weird behavior was fixed with this commit or any other commit).

In the meanwhile, since it is definitely not a DfMon bug, can I ask you to raise an issue in azure-functions-durable-extension ?

Konstantin Lepeshenkov · Answer 11 · Wed Mar 23 2022 03:41:59 GMT+0800 (China Standard Time)

I can confirm that the behavior is the same with latest v2.6.1.

But I can also confirm, that this behavior only takes place, when the activity method is synchronous (aka if it holds a thread).

If the method is asynchronous (e.g. marked with async keyword and is implemented accordingly), all its executions are cancelled correctly, after 10 minutes, resulting in TimeoutExceptions as expected.

So I suggest that you check that your activity methods are implemented as asynchronous (returning Tasks), since it is a best practice anyway, especially for methods that can take that long to execute.

Marcos Junior · Answer 12 · Wed Mar 23 2022 22:29:45 GMT+0800 (China Standard Time)

I do not have any activity which is not async. All return Task.
IMO I believe this happens when the server is busy enough to not start an activity immediately after scheduled. But if they do not record the actual execution start time, then indeed it is a bug to durable extension itself. I'm just not sure how to explain that properly as I don't know all the inner workings of durable extension.

Konstantin Lepeshenkov · Answer 13 · Wed Mar 23 2022 23:44:38 GMT+0800 (China Standard Time)

I do not have any activity which is not async. All return Task

But can it happen that they still hold a thread inside of them (by e.g. doing a Thread.Sleep() or something similar) ?

Marcos Junior · Answer 14 · Thu Mar 24 2022 04:02:19 GMT+0800 (China Standard Time)

Hm no, I have no code that holds a thread intentionally. And the function you see above is not the first one that I've found "running" for longer than 10 min on the Gantt chart. This is also not consistent as some instances are pretty fast, some are not, that's why I'm inclined to think this happens when the server is busy.

This is another example, totally different function app, a different taskhub, also supposed to end within 10 min.

App Insights:

Konstantin Lepeshenkov · Answer 15 · Fri Mar 25 2022 23:58:48 GMT+0800 (China Standard Time)

OK, so your orchestration actually starts lots of those activities. This is most likely the reason.

Because the default value for maxConcurrentActivityFunctions setting in Consumption mode is 10, and the auto-scaling mechanism is not responsive enough to those activities, so they get queued on the same single instance.

The solution is to set that maxConcurrentActivityFunctions setting to something more substantial.
Or else you could try to use another pricing tier, e.g. Premium. With your large amounts of executions this could also be potentially cheaper.

Varshitha Bachu · Answer 16 · Fri Apr 01 2022 04:01:00 GMT+0800 (China Standard Time)

It looks like the duration is calculated using timestamp - scheduledTime instead of the actual function start time. This means that the time between the function getting scheduled and actually starting is counted. The function start value is something we should expose from the Durable Functions extension to make the calculation more accurate.

Marcos Junior · Answer 17 · Fri Apr 01 2022 22:37:32 GMT+0800 (China Standard Time)

@scale-tone ATM we are not considering moving to premium plan because although we indeed fire lots of activities, there are not many orchestrations happening simultaneously, and we do not require a fast processing, so it is ok to wait a bit. I just feel like the Gantt chart could use more detailed reporting on "scheduled" vs "start time" as @bachuv just mentioned.

We are also not considering increasing maxConcurrentActivityFunctions as this number increases parallelism on the very same instance (instead of spawning more instances), which then can be worse because consumption plan instances are very basic machines.