Graceful shutdown with multi component in runtime fails

Question

Graceful shutdown with multi component in runtime fails

elabedo opened this issue 2 months ago · comments

elabedo commented 2 months ago

In what area(s)?

/area runtime

What version of Dapr?

1.13.x

Expected Behavior

Dapr continue to accept only requests coming from main application even after receiving SIGTERM, and wait that all routines are closed before shutdown.

Actual Behavior

We suppose that an application need to communicate with multiple components (Redis + Kafka) and invoke other services (Application 2 and Application 3):

The success scenario

It's more clear i guess in the picture, but let's describe it

User HTTP invoke the application through dapr and wait for final result
The main application (App1 in the picture) invoke an external services (it can be another dapr invoke or an external services invoke)
Dapr invoke Application 2 and 3 reply with responses
Same like App2
Same like App 2
The main application call dapr for state store (redis in this use case)
Dapr invoke redis and reply with results
The main application call dapr for PubSub prosucing data (kafka in this use case)
Dapr invoke kafka and reply with status
Global Reply to the user with sucess 👍

The failed scenario

In the middle of processing and dapr recieve a SIGTERM, the next invoke or component call will fail because dapr will not anymore accept any external request and will try to finalize the last routine before shutdown.

For example : (what we have like an issue)

step (1) to (5) was successful
In same time that step (6) was outgoing dapr recieve a SIGTERM
the step (6) and (7) was successful
Next step (8) will fail because dapr will not accept the request and all request of user automatically will be failed

We observe that we have already routine not finished (the request of the user waiting the response) but we don't accept anymore any request after the SIGTERM. It's an issue because the application not finished all theire processing with all components

We have a workaround by using --dapr-graceful-shutdown-seconds. The issues with this option is:

using a timer that we cannot quantify it. Example : we can set a long duration but it prevent to stop pods when scaledown waiting this timer
we don't know how many time external processing can take. Example: if we put dapr-graceful-shutdown-seconds < external service processing then dapr start gracefulshutown and block next component (i.e. pubsub/kafka )

Josh van Leeuwen · Answer 1 · Sat Mar 23 2024 00:35:36 GMT+0800 (China Standard Time)

Hi @elabedo, does --dapr-block-shutdown-duration do what you want? This will block until the app reports as unhealthy which you can control.

https://docs.dapr.io/reference/arguments-annotations-overview/

elabedo · Answer 2 · Sat Mar 23 2024 20:49:50 GMT+0800 (China Standard Time)

Hi @JoshVanL , i think using --dapr-block-shutdown-duration is a workaround more than the right solution as I have described above my opinion about this option. In fact, we cannot set a duration that we cannot predict which number to set.

For Example: If we put --dapr-block-shutdown-duration=20 and the last request finished to process before(i.e. 500ms to finish), so we prevent scale down of the pod because we need to wait 19,5 seconds based on the option used. Do you confirm that @JoshVanL ?

Josh van Leeuwen · Answer 3 · Sat Mar 23 2024 20:55:02 GMT+0800 (China Standard Time)

@elabedo the given duration is an upper bound. Dapr will also stop blocking when the app reports as unhealthy, which your application has runtime context for when to do, i.e. all messages have been processed.

elabedo · Answer 4 · Fri Mar 29 2024 17:57:42 GMT+0800 (China Standard Time)

Hi @JoshVanL, Thank you for your answer. If you can highlight in the documentation of --dapr-block-shutdown-duration that this value is the "upper bound".

Josh van Leeuwen · Answer 5 · Fri Mar 29 2024 23:29:37 GMT+0800 (China Standard Time)

@elabedo it is implied via the statement “from starting until the given duration has elapsed or the application becomes unhealthy”
https://docs.dapr.io/reference/arguments-annotations-overview/

I must admit that I remember the verbiage being a lot more explicit in conveying the “upper bound” behaviour, though I was thinking of the release notes.