System is able to "drop" erring messages without user interference

Question

System is able to "drop" erring messages without user interference

bsieber-mozilla opened this issue 10 months ago · comments

Yeah, I wonder if part of the answer here is to remove some of the burden from the heartbeat / SRE by making the system either more self-healing or self-repairable by customers. -- #343 (comment)

Along with additional validation features, like config-validation prior to merge, the goal would be for errors to not "halt" processing for any users.

When an error causes a WH queue to no longer send messages, the system becomes dependent on an admin to drop the erring message.

This act of dropping the message could be automated when a bugzilla request is received--if the processing results in a failure or error, that should be caught--once caught if the amount of webhook errors (from the webhook api) is greater than X (0?) than the message could be "dropped". This would align with current human-workflows.

We could possibly further this process by making this action transparent; via a slack message to the channel with the associated bug id.

System is able to "drop" erring messages without user interference

grahamalama · Answer 1 · Tue Oct 17 2023 00:34:01 GMT+0800 (China Standard Time)

As we discussed on Zoom:

In addition to dropping messages, we could (should?) also try to find a way to drop issues that are blocking the queue through the Bugzilla API. This endpoint doesn't currently exit -- we'd have to make this request to the Bugzilla devs to help us out.

bsieber · Answer 2 · Tue Nov 28 2023 05:26:28 GMT+0800 (China Standard Time)

As we discussed on Zoom:

In addition to dropping messages, we could (should?) also try to find a way to drop issues that are blocking the queue through the Bugzilla API. This endpoint doesn't currently exit -- we'd have to make this request to the Bugzilla devs to help us out.

To "drop" a message could be to send a 200 back to bugzilla when the errors in the webhook queue api is greater than 0 and at the end of processing a request it would've resulted in an error.

Perhaps its a bit odd to say 200/Success for an erring message--but I see this 200 as more of a, we've processed this message in a way that's aligned with how our system should process it. Any side-effects from the request being not sync'd (error messages on bugzilla bug) could be done as a last step in the processing?

In terms of using the Bugzilla API; it would be nice to have the feature mentioned:

drop last/latest request
read error for last/latest request
enable/disable webhook
...

grahamalama · Answer 3 · Tue Nov 28 2023 07:48:41 GMT+0800 (China Standard Time)

To "drop" a message could be to send a 200 back to bugzilla when the errors in the webhook queue api is greater than 0 and at the end of processing a request it would've resulted in an error.

That's true and is a good defensive option. It doesn't completely save us, but it would be a good addition to also having the ability to clear the queue the queue through the API.

Perhaps its a bit odd to say 200/Success for an erring message--but I see this 200 as more of a, we've processed this message in a way that's aligned with how our system should process it.

I agree, and in fact I think this is the recommended approach when using webhooks. A 2xx is more of an ack than anything else