Potential deadlock in managePublishQueue / Consider how `paho.Publish` should handle errors
MattBrittan opened this issue · comments
Describe the bug
In autopaho
managePublishQueue
we call PublishWithOptions
and if an error is returned wait on the connection to drop before trying again (so we assume that if PublishWithOptions
returns an error then it's fatal). This is not a valid assumption because publish may return at other times; for instance if c.PacketTimeout
is exceeded.
Solving this is going to require a bit of thought and modifications to paho
; consider publishQoS12
:
func (c *Client) publishQoS12(ctx context.Context, pb *packets.Publish, o PublishOptions) (*PublishResponse, error) {
c.debug.Println("sending QoS12 message")
pubCtx, cf := context.WithTimeout(ctx, c.PacketTimeout)
defer cf()
ret := make(chan packets.ControlPacket, 1)
if err := c.Session.AddToSession(pubCtx, pb, ret); err != nil {
return nil, err
}
So if we are attempting to send a lot of messages and c.Session.AddToSession
blocks for, by default, 10 seconds then publishQoS12
returns an error without actually making an attempt to send the message (due to c.PacketTimeout
). This makes it difficult for the caller to know what to do when it receives an error (should it retry or should it expect the connection to drop due to a protocol error).
I think there is a quick fix and a longer term project here:
- Quick fix - Autopaho should retry publishing from the queue if
PublishWithOptions
returns an error (checking the connection status before doing so) - Longer term - Modify Publish such that it's clear if the error is fatal (connection will drop) or can be retried. It would also be good to clarify the meaning of
c.PacketTimeout
(personally I don't like the way it's used inPublish
; the function already takes aContext
so adding an additional timeout is a bit confusing).
Software used:
The fix I have just merged should prevent this from being an issue with autopaho. However it would be good to review this code further and see if there is a better way of handling things (this might include looking at what to do if the client gets stalled due
to receive maximum depletion - should we drop the connection if this happens?).