zevv / nmqtt

Native Nim MQTT client library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Publish 100 msg with qos=2 fails

ThomasTJdev opened this issue · comments

Issue:
When publishing multiple messages (100'ish) with qos=2, not all messages are sent. This can be confirmed, when checking the length of ctx.workQueue or subscribing to the topic with Mosquitto. There's no consistency in the messages, which are not send.

Possible solutions:
First thought is, that this is a blocking in handling the package and the order of this. Needs debugging in nmqtt and check of package order in broker.

Test suite:
This currently fails in test "publish multiple message fast qos=2"

Example:
Example for publishing - monitor your broker.

import nmqtt, asyncdispatch
let ctx = newMqttCtx("hallopub")
ctx.set_host("127.0.0.1", 1883)

proc conn() {.async.} =
  await ctx.start()
  var msg: int
  for i in 1 .. 100:
    await ctx.publish("test1", $msg, 2)
    msg += 1

asyncCheck conn()
runForever()

Another async problem with work() called too often I guess. This causes send() to be called while it is already busy awaiting for the header to be sent. Before the rest of the packet goes out another header gets sent, messing up the protocol. send() is not ment to be reentrant, as it can await at more itself.

I don't know what caused this. I just verified this worked like it should at version a78eeca.

I don't know what caused this. I just verified this worked like it should at version a78eeca.

No, this commit also only works, if the messages are sent right after ctx.start(). If you insert a sleep after start(), the same problem arises.

import nmqtt, asyncdispatch
let ctx = newMqttCtx("hallopub")
ctx.set_host("127.0.0.1", 1883)

proc conn() {.async.} =
  await ctx.start()
  await sleepAsync 2300
  var msg: int
  for i in 1 .. 100:
    await ctx.publish("test1", $msg, 2)
    msg += 1

asyncCheck conn()
runForever()

Another async problem with work() called too often I guess. This causes send() to be called while it is already busy awaiting for the header to be sent. Before the rest of the packet goes out another header gets sent, messing up the protocol. send() is not ment to be reentrant, as it can await at more itself.

TL;DR

Yes, you are right. This is due to using send() in multiple procs, instead of only using it in work().

I'll prepare a PR.

The journey

I went back through all the commits - and the problem was introduced with my commit ddc0cc2, where I wait for the connection to be established. This is a simple sleepAsync, which just looks at the ctx.state... hmm..
Going back to master branch and commenting out the 2 "waiting" lines fixes the problem at first sight. This will make the example-code above work, but.. If I introduce another kind of wait, e.g. doing the Fibonnaci after await ctx.start() and before publish, we still have the crash.

The waiting for connection established

The reason why it works when removing the wait is, that all the messages are actually inserted into the workQueue before the connection is established. At the instance when the connection is established, ctx.work() is called and all the messages are fired away.
The two outputs below shows this - to indicate when the procs are called, a debugmsg is shown when they are initiated.
The Not working example tries to send the messages, but the connection is getting closed by the first instance of r = await ctx.s.recvInto(b.addr, b.sizeof). That causes the runConnect to reconnect - and after the reconnection the rest of messages are sent flawless due to our custom waiting time is gone.

Working - no wait after `ctx.start()`
connecting to 127.0.0.1:1883
tx> Connect(00): 00 04 4D 51 54 54 04 02 00 3C 00 08 68 61 6C 6C 6F 70 75 62 
runRx() - start
runping
rx> ConnAck(00): 00 00 
handle() - start
Connection established
work() - loop msg: 1
tx> Publish(04): 00 05 74 65 73 74 31 00 01 30 
work() - loop msg: 2
tx> Publish(04): 00 05 74 65 73 74 31 00 02 31 
work() - loop msg: 3
tx> Publish(04): 00 05 74 65 73 74 31 00 03 32 
work() - loop msg: 4
tx> Publish(04): 00 05 74 65 73 74 31 00 04 33 
work() - loop msg: 5
tx> Publish(04): 00 05 74 65 73 74 31 00 05 34 
work() - loop msg: 6
tx> Publish(04): 00 05 74 65 73 74 31 00 06 35 
work() - loop msg: 7
tx> Publish(04): 00 05 74 65 73 74 31 00 07 36 
work() - loop msg: 8
tx> Publish(04): 00 05 74 65 73 74 31 00 08 37 

[..truncated..]

work() - loop msg: 99
tx> Publish(04): 00 05 74 65 73 74 31 00 63 39 38 
work() - loop msg: 100
tx> Publish(04): 00 05 74 65 73 74 31 00 64 39 39 

runRx() - start
rx> PubRec(00): 00 01 # PUBREC/PUBREL START
handle() - start
tx> PubRel(02): 00 01 
runRx() - start
rx> PubRec(00): 00 02 
handle() - start
tx> PubRel(02): 00 02 

[..truncated..]

runRx() - start
rx> PubRec(00): 00 63 
handle() - start
tx> PubRel(02): 00 63 
runRx() - start
rx> PubRec(00): 00 64 
handle() - start
tx> PubRel(02): 00 64 

runRx() - start
rx> PubComp(00): 00 01 # PUBCOM START
handle() - start
runRx() - start
rx> PubComp(00): 00 02 
handle() - start
[..truncated..]
runRx() - start
rx> PubComp(00): 00 63 
handle() - start
runRx() - start
rx> PubComp(00): 00 64 
handle() - start
runRx() - start

ctx.workQueue.len is: 0
Not working - 300ms wait after `ctx.start()`
connecting to 127.0.0.1:1883
tx> Connect(00): 00 04 4D 51 54 54 04 02 00 3C 00 08 68 61 6C 6C 6F 70 75 62 
runRx() - start
runping
rx> ConnAck(00): 00 00 
handle() - start
Connection established
runRx() - start
work() - loop msg: 1
tx> Publish(04): 00 05 74 65 73 74 31 00 01 30 
work() - loop msg: 1
work() - loop msg: 2
tx> Publish(04): 00 05 74 65 73 74 31 00 02 31 
rx> PubRec(00): 00 01 
handle() - start
tx> PubRel(02): 00 01 
work() - loop msg: 1
work() - loop msg: 2
work() - loop msg: 3
tx> Publish(04): 00 05 74 65 73 74 31 00 03 32 
runRx() - start
rx> PubRec(00): 00 02 
handle() - start
tx> PubRel(02): 00 02 
work() - loop msg: 1
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
tx> Publish(04): 00 05 74 65 73 74 31 00 04 33 
runRx() - start
rx> PubComp(00): 00 01 
handle() - start
runRx() - start
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
work() - loop msg: 5
tx> Publish(04): 00 05 74 65 73 74 31 00 05 34 
rx> PubRec(00): 00 03 
handle() - start
tx> PubRel(02): 00 03 
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
work() - loop msg: 5
work() - loop msg: 6
tx> Publish(04): 00 05 74 65 73 74 31 00 06 35 
runRx() - start
rx> PubComp(00): 34 0A 
handle() - start
runRx() - start
Closing: remote closed connection
tx> Disconnect(00): 
connecting to 127.0.0.1:1883
tx> Connect(00): 00 04 4D 51 54 54 04 02 00 3C 00 08 68 61 6C 6C 6F 70 75 62 
runRx() - start
runping
rx> ConnAck(00): 00 00 
handle() - start
Connection established
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
work() - loop msg: 5
work() - loop msg: 6
work() - loop msg: 7
tx> Publish(04): 00 05 74 65 73 74 31 00 07 36 
work() - loop msg: 8
tx> Publish(04): 00 05 74 65 73 74 31 00 08 37 
work() - loop msg: 9
tx> Publish(04): 00 05 74 65 73 74 31 00 09 38 

[..truncated..]

work() - loop msg: 99
tx> Publish(04): 00 05 74 65 73 74 31 00 63 39 38 
work() - loop msg: 100
tx> Publish(04): 00 05 74 65 73 74 31 00 64 39 39 

runRx() - start
rx> PubRec(00): 00 07 # PUBREC/PUBREL start
handle() - start
tx> PubRel(02): 00 07 
runRx() - start
rx> PubRec(00): 00 08 
handle() - start
tx> PubRel(02): 00 08 

[..truncated..]

runRx() - start
rx> PubRec(00): 00 63 
handle() - start
tx> PubRel(02): 00 63 
runRx() - start
rx> PubRec(00): 00 64 
handle() - start
tx> PubRel(02): 00 64 

runRx() - start
rx> PubComp(00): 00 07 # PUBCOMP START
handle() - start
runRx() - start
rx> PubComp(00): 00 08 
handle() - start
runRx() - start

[..truncated..]

runRx() - start
rx> PubComp(00): 00 63 
handle() - start
runRx() - start
rx> PubComp(00): 00 64 
handle() - start
runRx() - start

ctx.workQueue.len is: 5

Recv PUBREL & PUBCOMP

The problem only arises with qos=2 messages, so a closer look at the PUBREL and PUBCOMP messages.

If we look at what the broker receives, we can see, that on the second PUBREL, the message id is wrong. Instead of getting Mid: 2 we are getting Mid: 13322

Output from broker including msgid
1584863298: mosquitto version 1.6.9 starting
1584863298: Config loaded from /etc/mosquitto/mosquitto.conf.
1584863298: Opening ipv4 listen socket on port 1883.
1584863298: Opening ipv6 listen socket on port 1883.
1584863300: New connection from 127.0.0.1 on port 1883.
1584863300: New client connected from 127.0.0.1 as hallopub (p2, c1, k60).
1584863300: No will message specified.
1584863300: Sending CONNACK to hallopub (0, 0)
1584863300: Received PUBLISH from hallopub (d0, q2, r0, m1, 'test1', ... (1 bytes))
1584863300: Sending PUBREC to hallopub (m1, rc0)
1584863300: Received PUBLISH from hallopub (d0, q2, r0, m2, 'test1', ... (1 bytes))
1584863300: Sending PUBREC to hallopub (m2, rc0)
1584863300: Received PUBREL from hallopub (Mid: 1)
1584863300: Sending PUBCOMP to hallopub (m1)
1584863300: Received PUBLISH from hallopub (d0, q2, r0, m3, 'test1', ... (1 bytes))
1584863300: Sending PUBREC to hallopub (m3, rc0)
1584863300: Received PUBREL from hallopub (Mid: 13322)
1584863300: Sending PUBCOMP to hallopub (m13322)
1584863300: Client hallopub disconnected due to protocol error.

When inspecting the packages with Wireshark, the output from the broker is confirmed. This can be seen at No 37. The broker receives a PUBREL with id=13322, and as requested returns a PUBCOMP with the same id (this is related to the PUBLISH in No 39).

Wireshark output
No.	Time	    Source	    Dest	Destination	Proto	Len	Info
06	0.000370867	127.0.0.1	1883	127.0.0.1	MQTT	86	Connect Command
08	0.000433631	127.0.0.1	57964	127.0.0.1	MQTT	70	Connect Ack
13	0.301515532	127.0.0.1	1883	127.0.0.1	MQTT	76	Publish Message (id=1) [test1]
15	0.301742342	127.0.0.1	57964	127.0.0.1	MQTT	70	Publish Received (id=1)
19	0.302150872	127.0.0.1	1883	127.0.0.1	MQTT	76	Publish Message (id=2) [test1]
23	0.302339634	127.0.0.1	57964	127.0.0.1	MQTT	70	Publish Received (id=2)
25	0.302436370	127.0.0.1	1883	127.0.0.1	MQTT	68	Publish Release (id=1)
29	0.302595442	127.0.0.1	57964	127.0.0.1	MQTT	70	Publish Complete (id=1)
31	0.302693887	127.0.0.1	1883	127.0.0.1	MQTT	76	Publish Message (id=3) [test1]
33	0.302831604	127.0.0.1	57964	127.0.0.1	MQTT	70	Publish Received (id=3)
37	0.303215958	127.0.0.1	1883	127.0.0.1	MQTT	68	Publish Release (id=13322)
39	0.303325285	127.0.0.1	57964	127.0.0.1	MQTT	70	Publish Complete (id=13322)
43	0.303423014	127.0.0.1	1883	127.0.0.1	MQTT	76	Reserved [TCP segment of a reassembled PDU]

So, we are sending a PubRel with a wrong msgId: 13322. When viewing the raw messages in recv(), this is corresponding to when receiving the PubComp, which content is 34 0A = 13322.

And just one more check: Debugging all the messages from send(); the contained exactly what they should! So no 13322-msgId were to be found.

The hidden sender

Since our problem only occurs with qos=2, when we are sending the PubRel, lets take a look at, what happens when the PubRec is received. When we receive a PubRec the handler calls onPubRec(), which eventually calls ctx.send(pkt).

If we comment out this send(), we can actually send all of messages - the error is gone. Well, since this is qos=2, we still need to send a corret PubRel, so we in the end can confirm, that we are getting a corresponding PubComp. We are doing that by adding the PubRel-pkg to the queue in work(). This way, it's only work() who calls send().

Taadaaa - it was a problem due to reentrace of send().

But but but, our last enemy is the ping - if we are sending 1000 msg's, and the ping-interval hits in the middel, we are still getting in trouble.

Conclusion

I had 2 paths - investigate from bottom to top myself, or just follow @zevv hunch. I choose the first path, which eventually led me to the result of path 2....

Besides the pretty long investigation, I now have a pretty good feel on nmqtt ;-)

Solution

  1. onPubRec() do not send the pubRel-pkg. Add pkg to workQueue.
  2. Enable sendPubRel() when pkg in work() fulfills work.state == PubRelSendReady.

Ah, right, will take a look (not today tho!)

That's fine. I'll take a break for now. I added PR #21 for inspiration - it should fix the above and some more.

This has been fixed in #21.

nim c -r test/tester "publish multiple message fast qos=2"
  [OK]