Publish 100 msg with qos=2 fails
ThomasTJdev opened this issue · comments
Issue:
When publishing multiple messages (100'ish) with qos=2, not all messages are sent. This can be confirmed, when checking the length of ctx.workQueue
or subscribing to the topic with Mosquitto. There's no consistency in the messages, which are not send.
Possible solutions:
First thought is, that this is a blocking in handling the package and the order of this. Needs debugging in nmqtt and check of package order in broker.
Test suite:
This currently fails in test "publish multiple message fast qos=2"
Example:
Example for publishing - monitor your broker.
import nmqtt, asyncdispatch
let ctx = newMqttCtx("hallopub")
ctx.set_host("127.0.0.1", 1883)
proc conn() {.async.} =
await ctx.start()
var msg: int
for i in 1 .. 100:
await ctx.publish("test1", $msg, 2)
msg += 1
asyncCheck conn()
runForever()
Another async problem with work() called too often I guess. This causes send()
to be called while it is already busy await
ing for the header to be sent. Before the rest of the packet goes out another header gets sent, messing up the protocol. send()
is not ment to be reentrant, as it can await
at more itself.
I don't know what caused this. I just verified this worked like it should at version a78eeca
.
I don't know what caused this. I just verified this worked like it should at version
a78eeca
.
No, this commit also only works, if the messages are sent right after ctx.start()
. If you insert a sleep after start()
, the same problem arises.
import nmqtt, asyncdispatch
let ctx = newMqttCtx("hallopub")
ctx.set_host("127.0.0.1", 1883)
proc conn() {.async.} =
await ctx.start()
await sleepAsync 2300
var msg: int
for i in 1 .. 100:
await ctx.publish("test1", $msg, 2)
msg += 1
asyncCheck conn()
runForever()
Another async problem with work() called too often I guess. This causes send() to be called while it is already busy awaiting for the header to be sent. Before the rest of the packet goes out another header gets sent, messing up the protocol. send() is not ment to be reentrant, as it can await at more itself.
TL;DR
Yes, you are right. This is due to using send()
in multiple procs, instead of only using it in work()
.
I'll prepare a PR.
The journey
I went back through all the commits - and the problem was introduced with my commit ddc0cc2, where I wait for the connection to be established. This is a simple sleepAsync
, which just looks at the ctx.state
... hmm..
Going back to master branch and commenting out the 2 "waiting" lines fixes the problem at first sight. This will make the example-code above work, but.. If I introduce another kind of wait, e.g. doing the Fibonnaci after await ctx.start()
and before publish, we still have the crash.
The waiting for connection established
The reason why it works when removing the wait is, that all the messages are actually inserted into the workQueue before the connection is established. At the instance when the connection is established, ctx.work()
is called and all the messages are fired away.
The two outputs below shows this - to indicate when the procs are called, a debugmsg is shown when they are initiated.
The Not working example tries to send the messages, but the connection is getting closed by the first instance of r = await ctx.s.recvInto(b.addr, b.sizeof)
. That causes the runConnect to reconnect - and after the reconnection the rest of messages are sent flawless due to our custom waiting time is gone.
Working - no wait after `ctx.start()`
connecting to 127.0.0.1:1883
tx> Connect(00): 00 04 4D 51 54 54 04 02 00 3C 00 08 68 61 6C 6C 6F 70 75 62
runRx() - start
runping
rx> ConnAck(00): 00 00
handle() - start
Connection established
work() - loop msg: 1
tx> Publish(04): 00 05 74 65 73 74 31 00 01 30
work() - loop msg: 2
tx> Publish(04): 00 05 74 65 73 74 31 00 02 31
work() - loop msg: 3
tx> Publish(04): 00 05 74 65 73 74 31 00 03 32
work() - loop msg: 4
tx> Publish(04): 00 05 74 65 73 74 31 00 04 33
work() - loop msg: 5
tx> Publish(04): 00 05 74 65 73 74 31 00 05 34
work() - loop msg: 6
tx> Publish(04): 00 05 74 65 73 74 31 00 06 35
work() - loop msg: 7
tx> Publish(04): 00 05 74 65 73 74 31 00 07 36
work() - loop msg: 8
tx> Publish(04): 00 05 74 65 73 74 31 00 08 37
[..truncated..]
work() - loop msg: 99
tx> Publish(04): 00 05 74 65 73 74 31 00 63 39 38
work() - loop msg: 100
tx> Publish(04): 00 05 74 65 73 74 31 00 64 39 39
runRx() - start
rx> PubRec(00): 00 01 # PUBREC/PUBREL START
handle() - start
tx> PubRel(02): 00 01
runRx() - start
rx> PubRec(00): 00 02
handle() - start
tx> PubRel(02): 00 02
[..truncated..]
runRx() - start
rx> PubRec(00): 00 63
handle() - start
tx> PubRel(02): 00 63
runRx() - start
rx> PubRec(00): 00 64
handle() - start
tx> PubRel(02): 00 64
runRx() - start
rx> PubComp(00): 00 01 # PUBCOM START
handle() - start
runRx() - start
rx> PubComp(00): 00 02
handle() - start
[..truncated..]
runRx() - start
rx> PubComp(00): 00 63
handle() - start
runRx() - start
rx> PubComp(00): 00 64
handle() - start
runRx() - start
ctx.workQueue.len is: 0
Not working - 300ms wait after `ctx.start()`
connecting to 127.0.0.1:1883
tx> Connect(00): 00 04 4D 51 54 54 04 02 00 3C 00 08 68 61 6C 6C 6F 70 75 62
runRx() - start
runping
rx> ConnAck(00): 00 00
handle() - start
Connection established
runRx() - start
work() - loop msg: 1
tx> Publish(04): 00 05 74 65 73 74 31 00 01 30
work() - loop msg: 1
work() - loop msg: 2
tx> Publish(04): 00 05 74 65 73 74 31 00 02 31
rx> PubRec(00): 00 01
handle() - start
tx> PubRel(02): 00 01
work() - loop msg: 1
work() - loop msg: 2
work() - loop msg: 3
tx> Publish(04): 00 05 74 65 73 74 31 00 03 32
runRx() - start
rx> PubRec(00): 00 02
handle() - start
tx> PubRel(02): 00 02
work() - loop msg: 1
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
tx> Publish(04): 00 05 74 65 73 74 31 00 04 33
runRx() - start
rx> PubComp(00): 00 01
handle() - start
runRx() - start
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
work() - loop msg: 5
tx> Publish(04): 00 05 74 65 73 74 31 00 05 34
rx> PubRec(00): 00 03
handle() - start
tx> PubRel(02): 00 03
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
work() - loop msg: 5
work() - loop msg: 6
tx> Publish(04): 00 05 74 65 73 74 31 00 06 35
runRx() - start
rx> PubComp(00): 34 0A
handle() - start
runRx() - start
Closing: remote closed connection
tx> Disconnect(00):
connecting to 127.0.0.1:1883
tx> Connect(00): 00 04 4D 51 54 54 04 02 00 3C 00 08 68 61 6C 6C 6F 70 75 62
runRx() - start
runping
rx> ConnAck(00): 00 00
handle() - start
Connection established
work() - loop msg: 2
work() - loop msg: 3
work() - loop msg: 4
work() - loop msg: 5
work() - loop msg: 6
work() - loop msg: 7
tx> Publish(04): 00 05 74 65 73 74 31 00 07 36
work() - loop msg: 8
tx> Publish(04): 00 05 74 65 73 74 31 00 08 37
work() - loop msg: 9
tx> Publish(04): 00 05 74 65 73 74 31 00 09 38
[..truncated..]
work() - loop msg: 99
tx> Publish(04): 00 05 74 65 73 74 31 00 63 39 38
work() - loop msg: 100
tx> Publish(04): 00 05 74 65 73 74 31 00 64 39 39
runRx() - start
rx> PubRec(00): 00 07 # PUBREC/PUBREL start
handle() - start
tx> PubRel(02): 00 07
runRx() - start
rx> PubRec(00): 00 08
handle() - start
tx> PubRel(02): 00 08
[..truncated..]
runRx() - start
rx> PubRec(00): 00 63
handle() - start
tx> PubRel(02): 00 63
runRx() - start
rx> PubRec(00): 00 64
handle() - start
tx> PubRel(02): 00 64
runRx() - start
rx> PubComp(00): 00 07 # PUBCOMP START
handle() - start
runRx() - start
rx> PubComp(00): 00 08
handle() - start
runRx() - start
[..truncated..]
runRx() - start
rx> PubComp(00): 00 63
handle() - start
runRx() - start
rx> PubComp(00): 00 64
handle() - start
runRx() - start
ctx.workQueue.len is: 5
Recv PUBREL & PUBCOMP
The problem only arises with qos=2
messages, so a closer look at the PUBREL and PUBCOMP messages.
If we look at what the broker receives, we can see, that on the second PUBREL
, the message id is wrong. Instead of getting Mid: 2
we are getting Mid: 13322
Output from broker including msgid
1584863298: mosquitto version 1.6.9 starting
1584863298: Config loaded from /etc/mosquitto/mosquitto.conf.
1584863298: Opening ipv4 listen socket on port 1883.
1584863298: Opening ipv6 listen socket on port 1883.
1584863300: New connection from 127.0.0.1 on port 1883.
1584863300: New client connected from 127.0.0.1 as hallopub (p2, c1, k60).
1584863300: No will message specified.
1584863300: Sending CONNACK to hallopub (0, 0)
1584863300: Received PUBLISH from hallopub (d0, q2, r0, m1, 'test1', ... (1 bytes))
1584863300: Sending PUBREC to hallopub (m1, rc0)
1584863300: Received PUBLISH from hallopub (d0, q2, r0, m2, 'test1', ... (1 bytes))
1584863300: Sending PUBREC to hallopub (m2, rc0)
1584863300: Received PUBREL from hallopub (Mid: 1)
1584863300: Sending PUBCOMP to hallopub (m1)
1584863300: Received PUBLISH from hallopub (d0, q2, r0, m3, 'test1', ... (1 bytes))
1584863300: Sending PUBREC to hallopub (m3, rc0)
1584863300: Received PUBREL from hallopub (Mid: 13322)
1584863300: Sending PUBCOMP to hallopub (m13322)
1584863300: Client hallopub disconnected due to protocol error.
When inspecting the packages with Wireshark, the output from the broker is confirmed. This can be seen at No 37
. The broker receives a PUBREL with id=13322
, and as requested returns a PUBCOMP with the same id (this is related to the PUBLISH in No 39
).
Wireshark output
No. Time Source Dest Destination Proto Len Info
06 0.000370867 127.0.0.1 1883 127.0.0.1 MQTT 86 Connect Command
08 0.000433631 127.0.0.1 57964 127.0.0.1 MQTT 70 Connect Ack
13 0.301515532 127.0.0.1 1883 127.0.0.1 MQTT 76 Publish Message (id=1) [test1]
15 0.301742342 127.0.0.1 57964 127.0.0.1 MQTT 70 Publish Received (id=1)
19 0.302150872 127.0.0.1 1883 127.0.0.1 MQTT 76 Publish Message (id=2) [test1]
23 0.302339634 127.0.0.1 57964 127.0.0.1 MQTT 70 Publish Received (id=2)
25 0.302436370 127.0.0.1 1883 127.0.0.1 MQTT 68 Publish Release (id=1)
29 0.302595442 127.0.0.1 57964 127.0.0.1 MQTT 70 Publish Complete (id=1)
31 0.302693887 127.0.0.1 1883 127.0.0.1 MQTT 76 Publish Message (id=3) [test1]
33 0.302831604 127.0.0.1 57964 127.0.0.1 MQTT 70 Publish Received (id=3)
37 0.303215958 127.0.0.1 1883 127.0.0.1 MQTT 68 Publish Release (id=13322)
39 0.303325285 127.0.0.1 57964 127.0.0.1 MQTT 70 Publish Complete (id=13322)
43 0.303423014 127.0.0.1 1883 127.0.0.1 MQTT 76 Reserved [TCP segment of a reassembled PDU]
So, we are sending a PubRel
with a wrong msgId: 13322
. When viewing the raw messages in recv()
, this is corresponding to when receiving the PubComp
, which content is 34 0A
= 13322
.
And just one more check: Debugging all the messages from send()
; the contained exactly what they should! So no 13322-msgId were to be found.
The hidden sender
Since our problem only occurs with qos=2
, when we are sending the PubRel
, lets take a look at, what happens when the PubRec
is received. When we receive a PubRec
the handler calls onPubRec()
, which eventually calls ctx.send(pkt)
.
If we comment out this send()
, we can actually send all of messages - the error is gone. Well, since this is qos=2
, we still need to send a corret PubRel
, so we in the end can confirm, that we are getting a corresponding PubComp
. We are doing that by adding the PubRel
-pkg to the queue in work()
. This way, it's only work()
who calls send()
.
Taadaaa - it was a problem due to reentrace of send()
.
But but but, our last enemy is the ping - if we are sending 1000 msg's, and the ping-interval hits in the middel, we are still getting in trouble.
Conclusion
I had 2 paths - investigate from bottom to top myself, or just follow @zevv hunch. I choose the first path, which eventually led me to the result of path 2....
Besides the pretty long investigation, I now have a pretty good feel on nmqtt
;-)
Solution
onPubRec()
do not send thepubRel
-pkg. Add pkg to workQueue.- Enable
sendPubRel()
when pkg inwork()
fulfillswork.state == PubRelSendReady
.
Ah, right, will take a look (not today tho!)
That's fine. I'll take a break for now. I added PR #21 for inspiration - it should fix the above and some more.
This has been fixed in #21.
nim c -r test/tester "publish multiple message fast qos=2"
[OK]