Potential deadlock in managePublishQueue / Consider how `paho.Publish` should handle errors

Question

Potential deadlock in managePublishQueue / Consider how `paho.Publish` should handle errors

MattBrittan opened this issue 7 months ago · comments

Describe the bug

In autopaho managePublishQueue we call PublishWithOptions and if an error is returned wait on the connection to drop before trying again (so we assume that if PublishWithOptions returns an error then it's fatal). This is not a valid assumption because publish may return at other times; for instance if c.PacketTimeout is exceeded.

Solving this is going to require a bit of thought and modifications to paho; consider publishQoS12:

func (c *Client) publishQoS12(ctx context.Context, pb *packets.Publish, o PublishOptions) (*PublishResponse, error) {
	c.debug.Println("sending QoS12 message")
	pubCtx, cf := context.WithTimeout(ctx, c.PacketTimeout)
	defer cf()

	ret := make(chan packets.ControlPacket, 1)
	if err := c.Session.AddToSession(pubCtx, pb, ret); err != nil {
		return nil, err
	}

So if we are attempting to send a lot of messages and c.Session.AddToSession blocks for, by default, 10 seconds then publishQoS12 returns an error without actually making an attempt to send the message (due to c.PacketTimeout). This makes it difficult for the caller to know what to do when it receives an error (should it retry or should it expect the connection to drop due to a protocol error).

I think there is a quick fix and a longer term project here:

Quick fix - Autopaho should retry publishing from the queue if PublishWithOptions returns an error (checking the connection status before doing so)
Longer term - Modify Publish such that it's clear if the error is fatal (connection will drop) or can be retried. It would also be good to clarify the meaning of c.PacketTimeout (personally I don't like the way it's used in Publish; the function already takes a Context so adding an additional timeout is a bit confusing).

Software used:

@master

Matt Brittan · Answer 1 · Tue Nov 14 2023 09:51:26 GMT+0800 (China Standard Time)

The fix I have just merged should prevent this from being an issue with autopaho. However it would be good to review this code further and see if there is a better way of handling things (this might include looking at what to do if the client gets stalled due
to receive maximum depletion - should we drop the connection if this happens?).