eclipse / paho.golang

Go libraries

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Manual ACK's, session state, and connection reestablishment

MattBrittan opened this issue · comments

Whilst working on session state persistence I believe (happy to be proven wrong!) I have come across a small issue with the design of the client API so am raising it for discussion. I'm not sure how frequently manual acknowledgements are used (@fracasula this may be one for you?). Note that this is not currently an issue with autopaho because it does not support manual acknowledgements (there is no way to call ACK).

While this is a fairly niche issue I think it's important that we come up with a solution before v1 because I believe an API change will be required.

Consider the following situation:

  • Manual ACK enabled
  • QOS1+ message received and passed to client
  • Connection drops and is re-established
  • client.ACK() called (either on the old or the new client)

There are issues here regardless of whether the session state survives the reconnection:

  • If the session is deleted then we may be acknowledging the wrong message (the broker may have reused the ID in the interim)
  • If the session state remains then it's not really clear what action to take (the message may well have been retransmitted and the library has no way of knowing if the client will eventually acknowledge it potentially leading to a duplicate ACK).

As mentioned above, the user may call ACK on the original Client or the new one; there are issues in both cases:

  • Client that dropped - The ACK cannot be sent (due to connection loss) and we cannot reliably update the session state (because the server may have retransmitted the message).
  • New Client - If the session was not retained (e.g. due to CleanSession) then sending the ACK may acknowledge a different message (as the server is free to reuse the ID).

I guess we could just say "this is the users problem - they should use a waitgroup to track handlers and only reconnect when they have all completed" but I'm not sure that is the best option....

I cannot see a good way of handling this currently because there is currently no reliable way of working out which connection the ACK relates to. A couple of ways this could be solved are:

  • Add a NoAck (or similar) function that indicates that the message will not be acknowledged. Then provide a way for the library to tell when all handlers have completed (i.e. either Ack or NoAck have been called for all QOS1+ publish messages passed to the user) and only reconnect after that time.
  • Use the V3 client approach where the struct passed to Router.Route() implements Ack() and the user calls that instead of client.Ack(). This would enable the library to prevent the Ack from being sent if the session was cleared (not a perfect solution as it's not really safe to assume a QOS1 resend is a duplicate).

This is not a problem that impacts me personally (I don't use manual Acks) so I'm not intending to put much time into the issue and don't really have a preferred solution. Note that some of the above assumes that paho has a session state mechanism hat can outlast a connection (i.e. a struct that is added to ClientConfig) - this is my current plan re managing session state.

Connection drops and is re-established

Since users have to bring their own connection, we don't deal with re-establishing connections at the paho level, right? In this case I think that the client is no longer in a usable state when the network connection drops, and users would have to create a new client with the same client ID and pass the session information to the new client to continue the session.

For the sake of this discussion, I'm going to assume that paho will require the user to create a new client to continue the session (if there's a reason why this shouldn't be the case, I'd be interested in hearing about it). In this case, the manual Acks sent on the previous client may have failed. Since the MQTT 5 spec requires the broker to redeliver the un'acked PUBLISHes, the new client will get those publishes again, and the user can ack those. If the session was not retained, then the Ack is lost.

we don't deal with re-establishing connections at the paho level, right?

Correct, but we do need to consider how the end users (or autopaho) will work with paho to meet the requirements of the MQTT spec. When making changes to paho (e.g. persistence) I use autopaho as a test case; i.e. how will autopaho handle the receipt of a QOS2 message where the ACK comes after a reconnection. I do this on the basis that if the feature is usable in autopaho then it's probably OK for other use-cases too (users importing paho directly).

Session management gets tricky because the session state survives disconnection/reconnection so the lifecycle is differs from paho's.

the new client will get those publishes again, and the user can ack those.

This is exactly the problem. QOS2 = "Exactly once delivery"; if we redeliver a message that the client has already ACK'd then QOS2 becomes pretty much worthless. My point here is that an inbound QOS2 message should be delivered once; they should only be redelivered should the users application make the explicit decision not to ACK the message.

After seeing the recent push for session persistence, I have attempted to migrate to autopaho and have run into this issue. Currently trying to assess a workaround but its non trivial. Our use case is a client that connects to a cluster, mqttv5 shared topics, auto reconnect, QOS 2, with manual ack, we also have session persistence on the cluster. This combo seems to be the only way to handle and ensure QOS 2. The v3 driver also lacks a reconnect option.

Can I help?

@ashtonian you are most welcome to help. My implementation is almost feature complete (ignore the readme, I need to find time to do some more testing and tidy things up a bit). I believe that the way I have implemented this will work with your setup (but it's not something I have tried) - perhaps you could assist with testing this under your environment?

The v3 driver also lacks a reconnect option.
The V3 library does have auto-reconnect functionality so I'm assuming you have an issue with this in your specific use case?

didn't realize that supported ack() already, thanks, will test in qa

It's not something I've really tested (but I believe it should work). Would really appreciate any feedback (I've added a lot more code than I would have liked but cannot see a better way of implementing this).

I have file store code written and will try to push that out today (along with some bug fixes and additional tests). Hopefully over the weekend I'll get time to do some further testing (I'm sure there are bugs but need to decide how much time to spend before creating a PR as this is a large change!).

@MattBrittan this is great work, thank you for this. Since this is still in beta, some bugs are acceptable, I think. If your implementation becomes the path forward for the project, I'd be willing to help work out any bugs that are found.

I made a draft PR in my fork for the sake of providing feedback. Curious to hear your thoughts on what I said there.

@vishnureddy17 - I'd really appreciate your review; have put a lot of work into this but struggling to find the time/motivation to finish it off (I think it's pretty much there but needs a bit more tidying/testing - I am conscious of the fact that the library is used in production in a number of code bases despite being pre 1.0 so want to be careful!). I've had a quick look at your comments but will be able to put more time into this next week (also keen to review with @alsm).

Sounds good. If this does end up getting merged into here with a new release tag, I'll be using it for a project we're dogfooding internally at my company. Hopefully that can provide some real-world feedback if there are any issues or bugs.

Testing out the fork. Couple of things I've noticed, not all to do with the fork.

  • if keepalive isn't set pinger will panic
  • if brokerUrl isn't set the client will just hang
  • keep alive is a uint16 type - this prevents long keep alive. Our prod server side service keep alive is 10 minutes on the consumption service as it lives along side the broker cluster and we noticed that in qos 0 we had contention with the keep alive and consuming messages coming in. Using the lib device side in the case of metered connections on NB-IoT, the devices cannot broadcast on the network more than once per a certain number of hours and we like to keep the connection open for long periods of time if possible as the ssl handshake can be larger than payload for simple stuff.
  • there was previously a SetOrderMatters(false) which I think handled ordering of pingresp/ack but I could be wrong.

Thanks very much for taking a look! (really keen to get feedback; I realise some tidying up in needed but want to prove the concept).

I believe that these are all issues with the existing library (but see comments below) as opposed to being something introduced in the fork (so probably best to raising separate issues to track). I'm keen to land the fork before adding too many further changes (as adding persistence is a huge change).

* if keepalive isn't set pinger will panic

Bug in current code (and also here). Probably should not start the pinger if keep alive is 0.

* if brokerUrl isn't set the client will just hang

This would refer to autopaho - and yes it does (in existing version). I guess we could check the length in NewConnection and return an error.

* keep alive is a uint16 type - this prevents long keep alive. Our prod server side service keep alive is 10 minutes on the consumption service as it lives along side the broker cluster and we noticed that in qos 0 we had contention with the keep alive and consuming messages coming in. Using the lib device side in the case of metered connections on NB-IoT, the devices cannot broadcast on the network more than once per a certain number of hours and we like to keep the connection open for long periods of time if possible as the ssl handshake can be larger than payload for simple stuff.

The KeepAlive unit is seconds so supports circa 18 hours; this is dictated in the MQTT v5 spec. Currently the ping handler only considers the keep alive period (it is not given info on when other packets are sent). Changing this would be an enhancement (but a worthwhile one).

* there was previously a SetOrderMatters(false) which I think handled ordering of pingresp/ack but I could be wrong.

The v3 client SetOrderMatters just set whether the publish call backs were called in a goroutine or not (it was not directly related to keepalives - but if SetOrderMatters(true) then long running handlers would cause keepalive issues). The v5 client always passes messages sequentially (this is really necessary to meet spec requirements).

RE: manual acks, another solution could be to have a reconnection counter and pass the value to the publish callback along with the publish packet. The Ack method would take the counter value as an argument. If the user calls Ack with a counter value that is lower than the client's current counter value, no PUBACK is sent.

For what it's worth, I think this is less elegant for the user than the two ideas proposed initially in this issue, but is a solution.

Reflecting on comment I added to #185

Recent changes also provide a way for users (including autopaho) to partially handle this. Routers are now optional, having been replaced with OnPublishReceived []func(PublishReceived) (bool, error) where PublishReceived is:

PublishReceived struct {
		Packet *Publish
		Client *Client // The Client that received the message (note that the connection may have been lost post-receipt)

		AlreadyHandled bool    // Set to true if a previous callback has returned true (indicating some action has already been taken re the message)
		Errs           []error // Errors returned by previous handlers (if any).
}

This means that the handler has access to the Client that received the message and can use that to call Ack. The Ack will fail if the connection has dropped/been re-established meaning there is a chance of double-ups at QOS2. To avoid that I think we would need to provide a way to delay the reconnection until all handlers have completed processing (this would require some thought and is really only an issue at QOS 2 so will probably not impact too many users).