eclipse / paho.golang

Go libraries

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect assumption about PUBLISH errors in autopaho can cause indefinite head-of-queue looping

vishnureddy17 opened this issue · comments

In autopaho, managePublishQueue assumes that if paho.PublishWithOptions() returns an error that is not paho.ErrNetworkErrorAfterStored, the error is either temporary or the connection will drop:

if _, err = cli.PublishWithOptions(ctx, &pub2, paho.PublishOptions{Method: paho.PublishMethod_AsyncSend}); err != nil {
c.errors.Printf("error publishing from queue: %s", err)
if errors.Is(err, paho.ErrNetworkErrorAfterStored) { // Message in session so remove from queue
if err := entry.Remove(); err != nil {
c.errors.Printf("error removing queue entry: %s", err)
}
} else {
if err := entry.Leave(); err != nil { // the message was not sent, so leave it in the queue
c.errors.Printf("error leaving queue entry: %s", err)
}
}
// The error might be fatal (connection will drop) or could be temporary (i.e. PacketTimeout exceeded)
// as a result we currently retry unless we know the connection has dropped, or it's time to exit
select {
case <-ctx.Done():
return ctx.Err()
case <-connDown:
continue connectionLoop
default: // retry
continue
}
}

However, this is not always the case. All the errors returned in the code snippet below for paho.PublishWithOptions() are neither temporary nor imply a pending disconnection:

paho.golang/paho/client.go

Lines 748 to 762 in a6def52

func (c *Client) PublishWithOptions(ctx context.Context, p *Publish, o PublishOptions) (*PublishResponse, error) {
if p.QoS > c.serverProps.MaximumQoS {
return nil, fmt.Errorf("cannot send Publish with QoS %d, server maximum QoS is %d", p.QoS, c.serverProps.MaximumQoS)
}
if p.Properties != nil && p.Properties.TopicAlias != nil {
if c.serverProps.TopicAliasMaximum > 0 && *p.Properties.TopicAlias > c.serverProps.TopicAliasMaximum {
return nil, fmt.Errorf("cannot send publish with TopicAlias %d, server topic alias maximum is %d", *p.Properties.TopicAlias, c.serverProps.TopicAliasMaximum)
}
}
if !c.serverProps.RetainAvailable && p.Retain {
return nil, fmt.Errorf("cannot send Publish with retain flag set, server does not support retained messages")
}
if (p.Properties == nil || p.Properties.TopicAlias == nil) && p.Topic == "" {
return nil, fmt.Errorf("cannot send a publish with no TopicAlias and no Topic set")
}

As a result, managePublishQueue will loop indefinitely on the first entry in the queue in these cases.

Proposed Solution
To solve this, I propose creating a custom error type called PahoArgumentError, which will be returned by paho.PublishWithOptions() in these cases (this error type will likely be relevant elsewhere in paho). Then autopaho can do a type assertion and quarantine any publish in the queue that returned PahoArgumentError.

Agreed; I did consider doing something similar when introducing this, but felt that it was OK for an initial PR (hard to know when to stop and submit the PR). My thought was something like paho.FatalError (indicating that it's pointless to retry) because this could also apply if the broker returns something like "Payload format invalid"). Happy with either option (will try to get a bit more work done on this over the next couple of days).

That sounds good, I also think it's worth returning such an error from Subscribe and Unsubscribe, which would be used to distinguish between errors that are due to a client failure/disconnection and errors that are due to something fatally wrong with the subscribe/unsubscribe

I've had an initial go at this; went with your solution (it's probably useful to be able to differentiate between
a range of "fatal errors"). Ideally this would be extended to look at broker response codes (but I don't think that
is as urgent).

closing due to #226, @MattBrittan feel free to reopen if you still think it's important to keep around

Ideally this would be extended to look at broker response codes (but I don't think that
is as urgent).

This is really only doable if #216 is addressed.

Also, even it was possible, I don't think that autopaho should retry if the ack came back with an error reason code. What if we're close to the server's receive maximum or we are about to run out of packet IDs? Then retrying could cause issues.