Losing messages during shutdown

Question

Losing messages during shutdown

joladev opened this issue 2 years ago · comments

Hello! We've been using this library in production for a while now and we noticed a not so ideal behavior during shutdowns. I'll try to describe what we believe is happening. The prepare_for_draining callback happens when Broadway is shutting down, Broadway does a GenServer.cast to the Producer process. This message ends up just sitting in the process mailbox though, as the process is busy waiting for the long polling request to GCP PubSub, with a timeout of infinity. Meanwhile the rest of the topology shuts down. When the producer receives messages from PubSub there is no one left to process them and they are lost (until ack deadline is reached and they reappear in the subscription). If it doesn't receive any messages it just blocks until it hits the graceful timeout deadline and is killed.

A potential approach would be to use a separate process to block on the long polling request and pass a message back to the producer when it finishes. Could be done with Tasks. This would leave the producer free to handle shutdowns and terminate the long polling request.

I would make an attempt at a PR but the repo is in a weird state, it seems main diverged from what's on Hex a long time ago? Not sure where I should branch off of

José Valim · Answer 1 · Tue Oct 11 2022 00:46:24 GMT+0800 (China Standard Time)

We migrated from Tesla, so master/main is the way to go. So first step is to investigate if master/main addresses this issue.

What you propose is a great option. Another option is to set timeout to not infinity and use a call instead of cast.

Johanna Larsson · Answer 2 · Wed Oct 26 2022 21:47:31 GMT+0800 (China Standard Time)

Just wanted to share this. After switching to this ref in production we're no longer losing messages in production (or at least, now there are so few lost that it doesn't register) and we were able to go back down to 1 topology 1 producer, handling the same number of messages per second. It's a huuuge improvement for us