michaelklishin / langohr

A small, feature complete Clojure client for RabbitMQ that embraces AMQP 0.9.1 model

Home Page:http://clojurerabbitmq.info

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The execution hangs when nested publish inside consumer handler function

visibletrap opened this issue · comments

This issue could be just be me doing something weird that shouldn't be doing. Please let me know if that's the case.

This issue happens on more complicated code but I simplified it to this. Consuming from a queue name in then republishing to a queue name out. It works fine if there are little number of messages in the in queue. But if there are a bit more messages in the queue (more than 10k messages on my machine), when I evaluate the snippet below, the execution will hang at the out queue declaration expression (lq/declare out-ch "out")

(let [in-ch (lch/open conn)]
  (lq/declare in-ch "in")
  (lcon/subscribe in-ch "in" (fn [channel meta message]
                               (let [out-ch (lch/open conn)]
                                 (lq/declare out-ch "out")
                                 (lb/publish out-ch "" "out" (str message))
                                 (lcore/close out-ch))
                               (lb/ack channel (:delivery-tag meta)))))

There are a few messages successfully published to the out queue. A few thousand messages are Unpacked and the remaining messages are Ready in the in queue. Note that the rabbitmq server doesn't reach memory or disk limited.

If I move the queue declaration to outside the consumer function like below, it changes to hang at close channel line (lcore/close out-ch)

(let [in-ch (lch/open conn)]
  (lq/declare in-ch "in")
  (lq/declare in-ch "out")
  (lcon/subscribe in-ch "in" (fn [channel meta message]
                               (let [out-ch (lch/open conn)]
                                 (lb/publish out-ch "" "out" (str message))
                                 (lcore/close out-ch))
                               (lb/ack channel (:delivery-tag meta)))))

Any idea what have I done incorrectly?

Publishing or declaring queues in a consumer callback is perfectly fine (has been possible in the Java client since at least 2.x). It can be a temporary flow control, e.g. because the queue had to begin moving messages to disk. Does it eventually unblock?

Can you post RabbitMQ log files and a JVM thread dump (when the blocking happens)?

Questions belong to the mailing list, by the way, so labelling as Question and closing but feel free to post the logs and thread dump here.

If channel closure hangs the same way when you change the order of operations this certainly suggests flow control, temporary or resource-driven.

After client connection established there's nothing else in the RabbitMQ log, so I don't copy it here. For the thread dump, I copy here only the part that different from the thread dump before executing the code that makes the blocking happens.

"pool-3-thread-6" #28 prio=5 os_prio=31 tid=0x00007fefe354f000 nid=0x5007 waiting on condition [0x000000012b98d000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000076d5127f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"pool-3-thread-5" #27 prio=5 os_prio=31 tid=0x00007fefe19cc800 nid=0x5e07 waiting on condition [0x000000012b6ba000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000076d5127f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"pool-3-thread-4" #26 prio=5 os_prio=31 tid=0x00007fefe1c0d000 nid=0x6507 in Object.wait() [0x000000012b39a000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x000000076da0e150> (a com.rabbitmq.utility.BlockingValueOrException)
    at java.lang.Object.wait(Object.java:502)
    at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:50)
    - locked <0x000000076da0e150> (a com.rabbitmq.utility.BlockingValueOrException)
    at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:89)
    - locked <0x000000076da0e150> (a com.rabbitmq.utility.BlockingValueOrException)
    at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:33)
    at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:348)
    at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:221)
    at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:118)
    at com.rabbitmq.client.impl.ChannelN.queueDeclare(ChannelN.java:834)
    at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.queueDeclare(AutorecoveringChannel.java:258)
    at langohr.queue$declare.invoke(queue.clj:72)
    at gs.playground$eval1410$fn__1411.invoke(playground.clj:86)
    at langohr.consumers$create_default$fn__280.invoke(consumers.clj:84)
    at langohr.consumers.proxy$com.rabbitmq.client.DefaultConsumer$ff19274a.handleDelivery(Unknown Source)
    at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:144)
    at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:99)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"pool-3-thread-3" #25 prio=5 os_prio=31 tid=0x00007fefe1aaf800 nid=0x4f0b waiting on condition [0x000000012a077000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000076d5127f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"AMQP Connection 192.168.4.3:5671" #23 prio=5 os_prio=31 tid=0x00007fefe1859800 nid=0x560b waiting on condition [0x000000012a421000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x000000076d7c8e00> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.rabbitmq.client.impl.VariableLinkedBlockingQueue.put(VariableLinkedBlockingQueue.java:280)
    at com.rabbitmq.client.impl.WorkPool.addWorkItem(WorkPool.java:197)
    at com.rabbitmq.client.impl.ConsumerWorkService.addWork(ConsumerWorkService.java:76)
    at com.rabbitmq.client.impl.ConsumerDispatcher.execute(ConsumerDispatcher.java:208)
    at com.rabbitmq.client.impl.ConsumerDispatcher.executeUnlessShuttingDown(ConsumerDispatcher.java:203)
    at com.rabbitmq.client.impl.ConsumerDispatcher.handleDelivery(ConsumerDispatcher.java:140)
    at com.rabbitmq.client.impl.ChannelN.processDelivery(ChannelN.java:418)
    at com.rabbitmq.client.impl.recovery.RecoveryAwareChannelN.processDelivery(RecoveryAwareChannelN.java:41)
    at com.rabbitmq.client.impl.ChannelN.processAsync(ChannelN.java:323)
    at com.rabbitmq.client.impl.AMQChannel.handleCompleteInboundCommand(AMQChannel.java:144)
    at com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:91)
    at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:552)
    at java.lang.Thread.run(Thread.java:745)

Nothing unusual in the state of threads and you use a fairly recent version. Have you checked I/O rates (e.g. with iostat) around the time this happens or capturing a network trace with Wireshark?

Thanks @michaelklishin
I've just recreated a minimal code snippet that cause the issue, removed the ssl connection from my original code so that I will be able to understand the network trace. This is it.

(ns hang.core
  (:require [langohr.core :as rmq]
            [langohr.channel :as lch]
            [langohr.queue :as lq]
            [langohr.consumers :as lc]
            [langohr.basic :as lb])
  (:gen-class))

(defn connect []
  (rmq/connect {:host "hostname"}))

(defn populate-source-queue []
  (let [conn (connect)
        ch (lch/open conn)]
    (lq/declare ch "source")
    (dotimes [_ 15000] ;; If the blocking doesn't happen to you, please try again with this number increased
      (lb/publish ch "" "source" "a string"))
    (lch/close ch)))

(defn consume-source-publish-destination []
  (let [conn (connect)
        channel (lch/open conn)]
    (lq/declare channel "destination")
    (lc/subscribe channel "source" (fn [ch meta message]
                                (let [pub-ch (lch/open conn)]
                                  (lb/publish pub-ch "" "destination" (str message))
                                  (lch/close pub-ch))
                                (lb/ack ch (:delivery-tag meta))))))

(defn -main [&]
  (println "Start populating source queue")
  (populate-source-queue)
  (println "Start consuming destination queue")
  (consume-source-publish-destination))

project.clj

(defproject hang "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.7.0"]
                 [com.novemberain/langohr "3.2.0"]]
  :aot [hang.core]
  :main hang.core)

Will examine the isolate and network trace as your suggestion.

There's a TCP Keep-Alive packets that are sent out when blocking happens but aren't send when the code is successfully run

screen shot 2015-07-22 at 11 26 23 am

No.     Time           Protocol Length 
   3073 15.068784000   AMQP     834    Basic.DeliverContent-Header Content-Body Basic.DeliverContent-Header Content-Body Basic.DeliverContent-Header Content-Body Basic.DeliverContent-Header Content-Body Basic.DeliverContent-Header Content-Body Basic.DeliverContent-Header Content-Body Basic.DeliverContent-Header Content-Body Basic.Deliver

Frame 3073: 834 bytes on wire (6672 bits), 834 bytes captured (6672 bits) on interface 0
Ethernet II, Src: CadmusCo_58:c2:50 (08:00:27:58:c2:50), Dst: 0a:00:27:00:00:00 (0a:00:27:00:00:00)
Internet Protocol Version 4, Src: 192.168.59.103 (192.168.59.103), Dst: 192.168.59.3 (192.168.59.3)
Transmission Control Protocol, Src Port: 5672 (5672), Dst Port: 49929 (49929), Seq: 775766, Ack: 3525, Len: 768
    Source Port: 5672 (5672)
    Destination Port: 49929 (49929)
    [Stream index: 2]
    [TCP Segment Len: 768]
    Sequence number: 775766    (relative sequence number)
    [Next sequence number: 776534    (relative sequence number)]
    Acknowledgment number: 3525    (relative ack number)
    Header Length: 32 bytes
    .... 0000 0001 1000 = Flags: 0x018 (PSH, ACK)
    Window size value: 235
    [Calculated window size: 30080]
    [Window size scaling factor: 128]
    Checksum: 0x9dbe [validation disabled]
    Urgent pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
    [SEQ/ACK analysis]
    [PDU Size: 61]
    [PDU Size: 23]
    [PDU Size: 16]
    [PDU Size: 61]
    [PDU Size: 23]
    [PDU Size: 16]
    [PDU Size: 61]
    [PDU Size: 23]
    [PDU Size: 16]
    [PDU Size: 61]
    [PDU Size: 23]
    [PDU Size: 16]
    [PDU Size: 61]
    [PDU Size: 23]
    [PDU Size: 16]
    [PDU Size: 61]
    [PDU Size: 23]
    [PDU Size: 16]
    [PDU Size: 61]
    [PDU Size: 23]
    [PDU Size: 16]
    [PDU Size: 61]
    TCP segment data (7 bytes)
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol
Advanced Message Queueing Protocol

No.     Time           Protocol Length 
   3074 15.279386000   TCP      66     [TCP Keep-Alive] 5672→49929 [ACK] Seq=776533 Ack=3525 Win=30080 Len=0 TSval=11511364 TSecr=650867805

Frame 3074: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0
Ethernet II, Src: CadmusCo_58:c2:50 (08:00:27:58:c2:50), Dst: 0a:00:27:00:00:00 (0a:00:27:00:00:00)
Internet Protocol Version 4, Src: 192.168.59.103 (192.168.59.103), Dst: 192.168.59.3 (192.168.59.3)
Transmission Control Protocol, Src Port: 5672 (5672), Dst Port: 49929 (49929), Seq: 776533, Ack: 3525, Len: 0
    Source Port: 5672 (5672)
    Destination Port: 49929 (49929)
    [Stream index: 2]
    [TCP Segment Len: 0]
    Sequence number: 776533    (relative sequence number)
    Acknowledgment number: 3525    (relative ack number)
    Header Length: 32 bytes
    .... 0000 0001 0000 = Flags: 0x010 (ACK)
    Window size value: 235
    [Calculated window size: 30080]
    [Window size scaling factor: 128]
    Checksum: 0x76a9 [validation disabled]
    Urgent pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
    [SEQ/ACK analysis]

No.     Time           Protocol Length 
   3075 15.700030000   TCP      66     [TCP Keep-Alive] 5672→49929 [ACK] Seq=776533 Ack=3525 Win=30080 Len=0 TSval=11511406 TSecr=650868015

Frame 3075: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0
Ethernet II, Src: CadmusCo_58:c2:50 (08:00:27:58:c2:50), Dst: 0a:00:27:00:00:00 (0a:00:27:00:00:00)
Internet Protocol Version 4, Src: 192.168.59.103 (192.168.59.103), Dst: 192.168.59.3 (192.168.59.3)
Transmission Control Protocol, Src Port: 5672 (5672), Dst Port: 49929 (49929), Seq: 776533, Ack: 3525, Len: 0
    Source Port: 5672 (5672)
    Destination Port: 49929 (49929)
    [Stream index: 2]
    [TCP Segment Len: 0]
    Sequence number: 776533    (relative sequence number)
    Acknowledgment number: 3525    (relative ack number)
    Header Length: 32 bytes
    .... 0000 0001 0000 = Flags: 0x010 (ACK)
    Window size value: 235
    [Calculated window size: 30080]
    [Window size scaling factor: 128]
    Checksum: 0x75ad [validation disabled]
    Urgent pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
    [SEQ/ACK analysis]

No.     Time           Protocol Length 
   3076 16.538918000   TCP      66     [TCP Keep-Alive] 5672→49929 [ACK] Seq=776533 Ack=3525 Win=30080 Len=0 TSval=11511490 TSecr=650868435

Frame 3076: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0
Ethernet II, Src: CadmusCo_58:c2:50 (08:00:27:58:c2:50), Dst: 0a:00:27:00:00:00 (0a:00:27:00:00:00)
Internet Protocol Version 4, Src: 192.168.59.103 (192.168.59.103), Dst: 192.168.59.3 (192.168.59.3)
Transmission Control Protocol, Src Port: 5672 (5672), Dst Port: 49929 (49929), Seq: 776533, Ack: 3525, Len: 0
    Source Port: 5672 (5672)
    Destination Port: 49929 (49929)
    [Stream index: 2]
    [TCP Segment Len: 0]
    Sequence number: 776533    (relative sequence number)
    Acknowledgment number: 3525    (relative ack number)
    Header Length: 32 bytes
    .... 0000 0001 0000 = Flags: 0x010 (ACK)
    Window size value: 235
    [Calculated window size: 30080]
    [Window size scaling factor: 128]
    Checksum: 0x73b5 [validation disabled]
    Urgent pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
    [SEQ/ACK analysis]

No.     Time           Protocol Length 
   3077 18.219945000   TCP      66     [TCP Keep-Alive] 5672→49929 [ACK] Seq=776533 Ack=3525 Win=30080 Len=0 TSval=11511658 TSecr=650869273

Frame 3077: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface 0
Ethernet II, Src: CadmusCo_58:c2:50 (08:00:27:58:c2:50), Dst: 0a:00:27:00:00:00 (0a:00:27:00:00:00)
Internet Protocol Version 4, Src: 192.168.59.103 (192.168.59.103), Dst: 192.168.59.3 (192.168.59.3)
Transmission Control Protocol, Src Port: 5672 (5672), Dst Port: 49929 (49929), Seq: 776533, Ack: 3525, Len: 0
    Source Port: 5672 (5672)
    Destination Port: 49929 (49929)
    [Stream index: 2]
    [TCP Segment Len: 0]
    Sequence number: 776533    (relative sequence number)
    Acknowledgment number: 3525    (relative ack number)
    Header Length: 32 bytes
    .... 0000 0001 0000 = Flags: 0x010 (ACK)
    Window size value: 235
    [Calculated window size: 30080]
    [Window size scaling factor: 128]
    Checksum: 0x6fc7 [validation disabled]
    Urgent pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
    [SEQ/ACK analysis]

No.     Time           Protocol Length 
   3078 19.979403000   TCP      268    [TCP segment of a reassembled PDU]

Frame 3078: 268 bytes on wire (2144 bits), 268 bytes captured (2144 bits) on interface 0
Ethernet II, Src: CadmusCo_58:c2:50 (08:00:27:58:c2:50), Dst: 0a:00:27:00:00:00 (0a:00:27:00:00:00)
Internet Protocol Version 4, Src: 192.168.59.103 (192.168.59.103), Dst: 192.168.59.3 (192.168.59.3)
Transmission Control Protocol, Src Port: 15672 (15672), Dst Port: 49297 (49297), Seq: 1845, Ack: 1413, Len: 202
    Source Port: 15672 (15672)
    Destination Port: 49297 (49297)
    [Stream index: 0]
    [TCP Segment Len: 202]
    Sequence number: 1845    (relative sequence number)
    [Next sequence number: 2047    (relative sequence number)]
    Acknowledgment number: 1413    (relative ack number)
    Header Length: 32 bytes
    .... 0000 0001 1000 = Flags: 0x018 (PSH, ACK)
    Window size value: 6515
    [Calculated window size: 6515]
    [Window size scaling factor: -1 (unknown)]
    Checksum: 0x66a2 [validation disabled]
    Urgent pointer: 0
    Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
    [SEQ/ACK analysis]
    TCP segment data (202 bytes)

If I separate connection for consume and publish the blocking issue disappear.

(defn consume-source-publish-destination []
  (let [conn1 (connect)
        channel (lch/open conn1)
        conn2 (connect)]
    (lq/declare channel "destination")
    (lc/subscribe channel "source" (fn [ch meta message]
                                     (let [pub-ch (lch/open conn2)]
                                       (lb/publish pub-ch "" "destination" (str message))
                                       (lch/close pub-ch))
                                     (lb/ack ch (:delivery-tag meta))))))

Only connections that publish are blocked.

TCP keep-alive packets are a red herring. We need a pcap capture (a binary file).

Everything posted today suggests you run into a resource-driven alarm. Are you 100% sure there are no alarms mentioned in the log? Can you at least send it to me privately? (michael in RabbitMQ domain).

I just sent that to you along with wireshark capture file.

If any of you perhaps following this issue, we decided to avoid it by not opening/closing channel in consumer handler function. You can refer to how we get to the conclusion from https://groups.google.com/forum/#!topic/clojure-rabbitmq/YRhDDMTOFWQ.

Thanks again, @michaelklishin.