mosquito / aio-pika

AMQP 0.9 client designed for asyncio and humans.

Home Page:https://aio-pika.readthedocs.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RobustChannel is not restoring after "Connection was stuck" error.

aozupek opened this issue · comments

aio-pika==9.2.2
aiormq==6.7.7
RabbitMQ: 3.12.2 in local docker container

Related issues: mosquito/aiormq#165, #575, #563

This problem arises with the "connection was stuck" bug (mosquito/aiormq#165) in aiormq. If I manually restart my RabbitMQ from docker or just toggle off/on the network of my docker container via docker network disconnect/connect bridge rabbitmq-server command, everything works as expected. All of the connections, channels, queues and exchanges are being restored sucessfully. However when the "connection was stuck" error occurs, only the connection is getting restored and channels remain closed. Since RobustChannel is responsible of restoring the queues and exchanges, all of them are remain closed too.

Reproducing the "connection was stuck" bug is not easy as it takes a period of time without any operation. Sometimes it may take 15 minutes and sometimes 4-5 hours. I've tried to change the heartbeat duration, let my computer sleep for a few minutes/hours, keep it open for hours without any operations, etc... but unfortunately I couldn't find a solid method to reproduce the bug.

Here are the logs when I catched it last time (heartbeat was 1):

DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x10327d1d0> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030>
DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x10327fe90> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030>
DEBUG:aiormq.connection:Prepare to send ChannelFrame(payload=b'\x08\x00\x00\x00\x00\x00\x00\xce', should_close=False, drain_future=None)
DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x10327d1d0> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030>
DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x10327fe90> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030>
DEBUG:aiormq.connection:Prepare to send ChannelFrame(payload=b'\x08\x00\x00\x00\x00\x00\x00\xce', should_close=False, drain_future=None)
WARNING:aiormq.connection:Server connection <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030> was stuck. No frames were received in 6 seconds.
DEBUG:aiormq.connection:Writer exited for <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030>
DEBUG:aiormq.connection:Reader exited for <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030>
DEBUG:aiormq.connection:Closing connection <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x103262030> cause: CancelledError()
INFO:aio_pika.robust_connection:Connection to amqp://guest:******@localhost:5672//?heartbeat=1 closed. Reconnecting after 2 seconds.
DEBUG:aio_pika.robust_connection:Connection attempt for <RobustConnection: "amqp://guest:******@localhost:5672//?heartbeat=1" 1 channels>
DEBUG:aiormq.connection:Connecting to: amqp://guest:******@localhost:5672//?heartbeat=1
DEBUG:aio_pika.robust_connection:Connection made on <RobustConnection: "amqp://guest:******@localhost:5672//?heartbeat=1" 1 channels>
DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x1032c5fd0> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x1032cbbb0>
DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x1032c5b90> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x1032cbbb0>
DEBUG:aiormq.connection:Prepare to send ChannelFrame(payload=b'\x08\x00\x00\x00\x00\x00\x00\xce', should_close=False, drain_future=None)
DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x1032c5fd0> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x1032cbbb0>
DEBUG:aiormq.connection:Received frame <pamqp.heartbeat.Heartbeat object at 0x1032c5b90> in channel #0 weight=8 on <Connection: "amqp://guest:******@localhost:5672//?heartbeat=1" at 0x1032cbbb0>

Since I couldn't reproduce the bug, I've tried to mimic it by calling await connection.reconnect() while connection is open. Since this method closes the transport and reconnects, it invokes the connection event callbacks.

async def reconnect(self) -> None:
    if self.transport:
        await self.transport.connection.close()

    await self.connect()
    await self.reconnect_callbacks()

Before await self.connect() line executes, it invokes the __close_callback() method of its channel instance with the default value of CancelledError for the exc argument so it just returns without clearing the __restored event.

async def __close_callback(self, _: Any, exc: BaseException) -> None:
    if isinstance(exc, asyncio.CancelledError):
        # This happens only if the channel is forced to close from the
        # outside, for example, if the connection is closed.
        # Of course, here you need to exit from this function
        # as soon as possible and to avoid a recovery attempt.
        return

    in_restore_state = not self.__restored.is_set()
    self.__restored.clear()

    if self._closed or in_restore_state:
        return

    await self.restore()

Thing is that: after connection reopens successfully it invokes the restore() method of its channel but since the __restored event is not cleared when channel was closed it just returns without invoking the reopen() method.

async def restore(self, channel: Any = None) -> None:
    if channel is not None:
        warnings.warn(
            "Channel argument will be ignored because you "
            "don't need to pass this anymore.",
            DeprecationWarning,
        )

    async with self.__restore_lock:
        if self.__restored.is_set():
            return

        await self.reopen()
        self.__restored.set()

So I think there is a minor logical bug in the __close_callback() method of RobustChannel class. We should clear the __restored event before returning if the exc argument is CancelledError.

async def __close_callback(self, _: Any, exc: BaseException) -> None:
    if isinstance(exc, asyncio.CancelledError):
        # This happens only if the channel is forced to close from the
        # outside, for example, if the connection is closed.
        # Of course, here you need to exit from this function
        # as soon as possible and to avoid a recovery attempt.

        # We should also clear the restored event
        self.__restored.clear()

        return

    in_restore_state = not self.__restored.is_set()
    self.__restored.clear()

    if self._closed or in_restore_state:
        return

    await self.restore()

@mosquito Clearing the __restored event fixes the issue for me and reopens the channel with queues and exchanges successfully however I couldn't test it with the actual "connection was stuck" error.

@mosquito I've tested it with the actual "connection was stuck" error and I can confirm that it restores the channel with its underlying queues and exchanges.

Your fix released as aio-pika==9.2.3