dapr / dapr

Dapr is a portable, event-driven, runtime for building distributed applications across cloud and edge.

Home Page:https://dapr.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Graceful shutdown with multi component in runtime fails

elabedo opened this issue · comments

In what area(s)?

/area runtime

What version of Dapr?

1.13.x

Expected Behavior

Dapr continue to accept only requests coming from main application even after receiving SIGTERM, and wait that all routines are closed before shutdown.

Actual Behavior

We suppose that an application need to communicate with multiple components (Redis + Kafka) and invoke other services (Application 2 and Application 3):
image

The success scenario

It's more clear i guess in the picture, but let's describe it

  1. User HTTP invoke the application through dapr and wait for final result
  2. The main application (App1 in the picture) invoke an external services (it can be another dapr invoke or an external services invoke)
  3. Dapr invoke Application 2 and 3 reply with responses
  4. Same like App2
  5. Same like App 2
  6. The main application call dapr for state store (redis in this use case)
  7. Dapr invoke redis and reply with results
  8. The main application call dapr for PubSub prosucing data (kafka in this use case)
  9. Dapr invoke kafka and reply with status
  10. Global Reply to the user with sucess 👍

The failed scenario

In the middle of processing and dapr recieve a SIGTERM, the next invoke or component call will fail because dapr will not anymore accept any external request and will try to finalize the last routine before shutdown.

For example : (what we have like an issue)

  • step (1) to (5) was successful
  • In same time that step (6) was outgoing dapr recieve a SIGTERM
  • the step (6) and (7) was successful
  • Next step (8) will fail because dapr will not accept the request and all request of user automatically will be failed

We observe that we have already routine not finished (the request of the user waiting the response) but we don't accept anymore any request after the SIGTERM. It's an issue because the application not finished all theire processing with all components

We have a workaround by using --dapr-graceful-shutdown-seconds. The issues with this option is:

  • using a timer that we cannot quantify it. Example : we can set a long duration but it prevent to stop pods when scaledown waiting this timer
  • we don't know how many time external processing can take. Example: if we put dapr-graceful-shutdown-seconds < external service processing then dapr start gracefulshutown and block next component (i.e. pubsub/kafka )

Hi @elabedo, does --dapr-block-shutdown-duration do what you want? This will block until the app reports as unhealthy which you can control.

https://docs.dapr.io/reference/arguments-annotations-overview/

Hi @JoshVanL , i think using --dapr-block-shutdown-duration is a workaround more than the right solution as I have described above my opinion about this option. In fact, we cannot set a duration that we cannot predict which number to set.

For Example: If we put --dapr-block-shutdown-duration=20 and the last request finished to process before(i.e. 500ms to finish), so we prevent scale down of the pod because we need to wait 19,5 seconds based on the option used. Do you confirm that @JoshVanL ?

@elabedo the given duration is an upper bound. Dapr will also stop blocking when the app reports as unhealthy, which your application has runtime context for when to do, i.e. all messages have been processed.

Hi @JoshVanL, Thank you for your answer. If you can highlight in the documentation of --dapr-block-shutdown-duration that this value is the "upper bound".

@elabedo it is implied via the statement “from starting until the given duration has elapsed or the application becomes unhealthy”
https://docs.dapr.io/reference/arguments-annotations-overview/

I must admit that I remember the verbiage being a lot more explicit in conveying the “upper bound” behaviour, though I was thinking of the release notes.