[Feature request] Dead letter queue for NATS
embano1 opened this issue · comments
My actions before raising this issue
- Followed the troubleshooting guide
- Read/searched the docs
- Searched past issues
Using async invocation it seems there's no way to tell whether the invocation eventually succeeded. Failure could be caused by API issues, functions being deleted/not accepting connections (SIGTERM), event payload issues causing exceptions or simple app logic bugs within the function.
For async invocation this is usually handled with a dead letter queue (DLQ). I could not find any mention of DLQ support in OpenFaaS/NATS (STAN). How is this dealt with today? Is it a concern at all? Does STAN automatically redrive failed invocations? If so, how many until it gives up?
Expected Behaviour
Failure during async function invocation should be trackable, if possible using DLQ where events can be inspected and potentially redriven.
Current Behaviour
Tested async invocation via faas-cli
and a connector using connector-sdk
where the subscribed function does not exist (anymore). There was no error reported leaving the caller believing that the invocation would eventually succeed (even though 202
technically does not give a guarantee, so introspection capabilities would be generally useful in a 202
setup).
A work around seems to be to provide callbacks where the error status can be introspected. Not sure if this is always possible (CLI) or desired.
Details see here: openfaas/faas#1298
Possible Solution
Implement a DLQ capability. Are there already metrics exposed for failed async function invocations?
Steps to Reproduce (for bugs)
Simply call faas-cli -a
(or curl
) on a non-existing function.
Context
I sense potential consistency issues (no error reported while the function was not executed at all) leading to hard to debug issues. Also, malformed payloads and application logic bugs could be hidden by the current implementation (if my understanding of the issue is correct and complete).
/set title: [Feature request] Dead letter queue for NATS
NATS does not provide a DLQ. I spent some time looking into build a DLQ when building colorisebot, but it's complicated. If the upstream API is failing due to rate-limiting, then retrying N times without an appropriate back-off is counter-productive.