Last Will with QoS 2 fails with an exception instead of being downgraded to QoS 1
pbilstein opened this issue · comments
Hi,
whenever we are setting a MQTT Last Will with QoS 2, the plugin crashes when it tries to fulfill it:
2019-11-04 23:09:43 =CRASH REPORT====
crasher:
initial call: rabbit_channel:init/1
pid: <0.11035.1>
registered_name: []
exception exit: {{function_clause,[{rabbit_mqtt_processor,delivery_mode,[2],[{file,"src/rabbit_mqtt_processor.erl"},{line,718}]},{rabbit_mqtt_processor,amqp_pub,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,836}]},{rabbit_mqtt_processor,send_will,1,[{file,"src/rabbit_mqtt_processor.erl"},{line,784}]},{rabbit_web_mqtt_handler,terminate,3,[{file,"src/rabbit_web_mqtt_handler.erl"},{line,163}]},{cowboy_websocket,terminate,3,[{file,"src/cowboy_websocket.erl"},{line,617}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,259}]}]},[{gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1183}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,259}]}]}
ancestors: [<0.11033.1>,rabbit_direct_client_sup,rabbit_sup,<0.265.0>]
message_queue_len: 0
messages: []
links: [<0.11033.1>]
dictionary: [{process_name,{rabbit_channel,{<<"172.24.0.15:37672 -> 172.24.0.3:9001">>,1}}},{guid_secure,{{31,'rabbit@rabbitmq-1',#Ref<0.1828259457.2187591682.167217>},0}},{channel_operation_timeout,15000},{rand_seed,{#{jump => #Fun<rand.3.8986388>,max => 288230376151711743,next => #Fun<rand.2.8986388>,type => exsplus},[80411076728963622|136058777223923014]}}]
trap_exit: true
status: running
heap_size: 1598
stack_size: 27
reductions: 10759
neighbours:
2019-11-04 23:09:43 =SUPERVISOR REPORT====
Supervisor: {<0.11033.1>,rabbit_channel_sup}
Context: child_terminated
Reason: {function_clause,[{rabbit_mqtt_processor,delivery_mode,[2],[{file,"src/rabbit_mqtt_processor.erl"},{line,718}]},{rabbit_mqtt_processor,amqp_pub,2,[{file,"src/rabbit_mqtt_processor.erl"},{line,836}]},{rabbit_mqtt_processor,send_will,1,[{file,"src/rabbit_mqtt_processor.erl"},{line,784}]},{rabbit_web_mqtt_handler,terminate,3,[{file,"src/rabbit_web_mqtt_handler.erl"},{line,163}]},{cowboy_websocket,terminate,3,[{file,"src/cowboy_websocket.erl"},{line,617}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,259}]}]}
Offender: [{pid,<0.11035.1>},{name,channel},{mfargs,{rabbit_channel,start_link,[1,<0.11032.1>,<0.11032.1>,<0.11026.1>,<<"172.24.0.15:37672 -> 172.24.0.3:9001">>,rabbit_framing_amqp_0_9_1,{user,<<"mqtt_default">>,[mqtt_default],[{rabbit_auth_backend_internal,none}]},<<"mqtt">>,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],<0.11029.1>,<0.11034.1>]}},{restart_type,intrinsic},{shutdown,70000},{child_type,worker}]
RabbitMq version: Docker rabbitmq:3.7.20-management-alpine
Client Lib: mqttjs 3.0.0
When the Last Will has set qos 0 or 1, it works.
QoS 2 is not supported (it was explained many times why but long story short: it promises something it cannot deliver without cluster-wide coordination of clients and nodes, which is not practical and no MQTT implementation does) and is downgraded to QoS 1 in RabbitMQ. In most places but not for Last Will messages, apparently.
Ok, yes I understand the fact that qos 2 is not fully supported and that it should only be downgraded to 1. Thanks for clarifying the issue.
In our case we have a couple of clients out there setting last will with qos 2, which we do not have under control at the moment. Therefore, it would be very nice to have a fix here.
We understand that clients will [try to] use QoS 2. It should be downgraded everywhere gracefully.
Ok great. I had a short look at the code and changed the following part (in branch 3.7.20), which seems to fix the issue for us and our tests are still all green:
# rabbit_mqtt_processor.erl
718 delivery_mode(?QOS_0) -> 1;
719 delivery_mode(?QOS_1) -> 2.
->
# rabbit_mqtt_processor.erl
718 delivery_mode(?QOS_0) -> 1;
719 delivery_mode(?QOS_1) -> 2;
720 delivery_mode(?QOS_2) -> 2.
Maybe it helps.
I guess a PR is too heavy for it, right? Or would it be appreciated?
I guess a PR is too heavy for it, right? Or would it be appreciated?
Yes, it would be appreciated, thank you
I've started looking at an integration test. We'd consider a PR (thank you!) but would only consider this issue resolve after we add a test.
Wow, that was fast. I just wanted to prepare a PR with a test and saw you already did it. Next time, I will try to be first ;-)
No worries :) You have contributed by identifying the problem and even narrowing it down to one function. That's very valuable 👍
We'll post an update to this issue when Concourse produces 3.8 and 3.7 alphas that you can use to verify.