`Message.ack_immediately/1` should return a list of messages which could not be acknowledged successfully
joeLepper opened this issue · comments
Background
We have a use-case where a consumer calls Message.ack_immediately/1
as soon as it pulls a message from a queue to ensure that no other consumers handle it – we would rather not handle a message than double-handle a message and therefore take pains to ensure that we ack
as soon as possible.
Recently we have noticed that some messages do get double-handled. Digging into this we realized that when Message.ack_immediately/1
gets called it does not ensure that the ack
was successful, nor does it alert our consumer that the ack
was unsuccessful. Rather, it ignores any failures or errors in acknowledging receipt of a message.
The cases that we have observed in our logging indicate that the process that broadway_rabbit
is calling to ack
the message
is no longer alive.
"Could not ack/reject message: ** (exit) exited in: :gen_server.call(#PID<0.14927.0>, {:call, {:\"basic.ack\", 3, false}, :none, #PID<0.3840.0>}, 60000) -- ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started -- (stdlib 3.11.1) gen_server.erl:223: :gen_server.call/3 -- (amqp 1.6.0) lib/amqp/basic.ex:135: AMQP.Basic.ack/3 -- (broadway_rabbitmq 0.6.2) lib/broadway_rabbitmq/producer.ex:440: anonymous fn/3 in BroadwayRabbitMQ.Producer.ack_messages/3 -- (elixir 1.10.2) lib/enum.ex:783: Enum.\"-each/2-lists^foreach/1-0-\"/2 -- (elixir 1.10.2) lib/enum.ex:783: En"
This log entry originates here
Details
Message.ack_immediately/1
callsAcknowledger.ack_messages/2
but does nothing with its return value:broadway/lib/broadway/message.ex
Line 151 in f1cc5e3
Acknowledger.ack_messages/2
returnsnil
, so there's actually no way forMessage.ack_immediately/1
to understand which messages my have not been successfully acknowledged:broadway/lib/broadway/acknowledger.ex
Lines 62 to 67 in f1cc5e3
- Fortunately
Acknowledger.call_ack/2
does return the value that is produced by callingacknowledger.ack/3
here:broadway/lib/broadway/acknowledger.ex
Lines 78 to 82 in f1cc5e3
acknowledger.ack/3
is a call into the underlying connector implementation (we are using rabbit, so I'll provide examples from that lib, but I believe this symptom exists in all of the connectors): https://github.com/dashbitco/broadway_rabbitmq/blob/7a9d618e536e91e9b6a30c3e624504b7181265ea/lib/broadway_rabbitmq/producer.ex#L382-L386- inside
broadway_rabbitmq
'sProducer.ack_messages/3
, errors in acknowledging the message are caught and logged, but not passed back up to the caller: https://github.com/dashbitco/broadway_rabbitmq/blob/7a9d618e536e91e9b6a30c3e624504b7181265ea/lib/broadway_rabbitmq/producer.ex#L444-L448 - finally (and this is probably a bug with
broadway_rabbit
which will be exposed by undertaking our proposed fix)Producer.apply_ack_func/3
calls into theamqp
library, which returns either:ok
or{:error, error}
, these errors are also not getting passed back up to the caller
Proposal
Consumers which are calling Message.ack_immediately/1
are doing so because they need their message
to definitely be acknowledged before processing it. In our case, if the message
cannot be successfully acknowledged, we would rather drop it than process it. Therefore, Broadway should return a list of messages which could not be successfully acknowledged.
Sort messages handled by BroadwayRabbitMQ.Producer. ack_messages /3
into successful and unsuccessful groups
Replace Enum.each
in BroadwayRabbitMQ.Producer. ack_messages /3
, with Enum.reduce
sorting the messages which have been successfully acknowledged from those which were not into a map sort of like the following.
%{
successes: [...messages...],
failures: [...messages...]
}
This will involve both checking the return value from apply_ack_func
to see if it is :ok
or {:error, error}
and passing any messages which end up in the catch
block.
Return this map from BroadwayRabbitMQ.Producer. ack_messages /3
Return an acknowledgement status for each message passed toBroadway.Acknowledger.ack_messages/2
Replace the Enum.each
in BroadwayRabbitMQ.Producer.ack_messages /3
with an Enum.reduce
which merges the maps returned from Broadway.Acknowledger.ack_messages/2
together, and return it.
Return the acknowledgement status for each message passed to Broadway.Message.ack_immediately/1
Stop ignoring the return value of Broadway.Acknowledger.ack_messages/2
(because it is not longer always nil
) and pass that to the caller.
Repeat this process for Broadway's other connectors
The other Broadway connectors will need to have their Producer.ack_messages/3
function updated to return this acknowledgement status map.
Conclusion
We are happy to submit fixes as outlined in this issue (or a different approach which might come out of conversation here). I'm going to craft a draft so that there is something a bit more concrete to pick at as a straw man.
Hi @joeLepper, thanks for the detailed wrap-up.
Digging into this we realized that when Message.ack_immediately/1 gets called it does not ensure that the ack was successful, nor does it alert our consumer that the ack was unsuccessful.
This is not supposed to happen. The ack
callback should fail if it cannot acknowledge a message. My suggestion is to make sure the RabbitMQ driver is raising in these scenarios, which will surface it enough for you to pick it up.
@josevalim so you advise going with an exception if broadway_rabbitmq can't ack the message? I think that's the quickest path to success here, but maybe by returning whether the ack callback succeeded or not we're not gonna paint ourselves in a corner for the future if we want to do it at some point, since the acker is a behavior IIRC.
we already try/catch in the other places, so broadway_rabbitmq should definitely raise if it failed. I agree this is not ideal but this is a non-breaking change we can do right now.
Alright, that makes sense, we'll get on it :)
@josevalim we fixed broadway_rabbitmq and released v0.6.3. Do we need to do anything here in Broadway or in other drivers, or can we close this?