jestream pull_subscribe() not reconnecting after server reboot

Question

jestream pull_subscribe() not reconnecting after server reboot

martinsson opened this issue 8 months ago · comments

Johan Martinsson commented 8 months ago

What version were you using?

v2.9.6

What environment was the server running in?

docker

Is this defect reproducible?

When the nats servers becomes unavailable for a short while any pull_subscription is lost (i.e. stops receiving messages), whereas push subscribers works as expected.

Reproduce

Run the following script. Reboot the nats server. Emit messages on the queue testnats.reconnec. Nothing is received

import asyncio

import nats

SUBJECT = "testnats.reconnect"
CLUSTER_ADDRESS = "nats://localhost:4222"


async def run(loop):
    async def disconnected_cb():
        print("Disconnected!")

    async def reconnected_cb():
        print("Reconnected!")

    async def error_cb(e):
        print("Error:", e)

    nc = await nats.connect(
        servers=[CLUSTER_ADDRESS],
        error_cb=error_cb,
        disconnected_cb=disconnected_cb,
        reconnected_cb=reconnected_cb,
        max_reconnect_attempts=-1,

    )

    js = nc.jetstream()

    async def message_handler(msg):
        subject = msg.subject
        reply = msg.reply
        data = msg.data.decode()
        print(
            "Received a message on '{subject} {reply}': {data}".format(
                subject=subject, reply=reply, data=data
            )
        )
        await msg.ack()


    # await js.subscribe(SUBJECT, cb=message_handler)
    subscription = await js.pull_subscribe(SUBJECT, "durable_name", "TEST_STREAM")
    while True:
        try:
            messages = await subscription.fetch(1, 10000000)
            for message in messages:
                await message_handler(message)
        except Exception as e:
            print(e)


if __name__ == "__main__":
    loop = asyncio.new_event_loop()
    loop.create_task(run(loop))
    loop.run_forever()

It works as expected if I uncomment the pull_subscribe logic and replace it with the commented out js.subscribe() line

Given the capability you are leveraging, describe your expectation?

Pull subscribers should reconnect after reboot

Given the expectation, what is the defect you are observing?

pull subscribers are not resubscribing

Guillaume Charbonnier · Answer 1 · Mon Oct 09 2023 22:26:45 GMT+0800 (China Standard Time)

I tried to create a new test under tests/test_js.py:PullSubscribeTest and ran it:

    @async_long_test
    async def test_pull_subscribe_reconnect(self):
        srv = self.server_pool[0]
        nc = NATS()
        await nc.connect()
        js = nc.jetstream()

        await js.add_stream(name="TEST1", subjects=["foo.1", "bar"])

        for i in range(10):
            ack = await js.publish("foo.1", f"Hello from NATS!".encode())
            assert ack.stream == "TEST1"
            assert ack.seq == i + 1

        consumer = await js.pull_subscribe("foo.1", "dur")
        for i in range(10):
            # Test fails if we uncomment the following lines
            # srv.stop()
            # srv.start()
            msg, *rest = await consumer.fetch(1, timeout=10)
            assert msg.data == b"Hello from NATS!"
            assert msg.metadata.stream == "TEST1"
            await msg.ack()

This test passes, but if I uncomment the srv.stop() and srv.start() lines the test does not pass.

@wallyqs does this test describe correctly the auto-reconnect feature that we should expect from nats-py ?

Edit

The following test passes successfully (the call to .fetch_one() is wrapped within a while True loop with a lower timeout):

    @async_long_test
    async def test_pull_subscribe_reconnect(self):
        srv = self.server_pool[0]
        nc = NATS()
        await nc.connect(
            max_reconnect_attempts=-1,
            allow_reconnect=True,
            reconnect_time_wait=1,
        )
        js = nc.jetstream()

        await js.add_stream(name="TEST1", subjects=["foo.1", "bar"])

        for i in range(3):
            ack = await js.publish("foo.1", "Hello from NATS!".encode())
            assert ack.stream == "TEST1"
            assert ack.seq == i + 1
        consumer = await js.pull_subscribe("foo.1", "dur")
        for i in range(3):
            srv.stop()
            srv.start()
            while True:
                try:
                    msg, *rest = await consumer.fetch(1, timeout=0.5)
                except TimeoutError:
                    continue
                break
            assert msg.data == b"Hello from NATS!"
            assert msg.metadata.stream == "TEST1"
            await msg.ack()

It seems that using a big or infinite timeout leads to the error in first test, because the fetch request is never sent to the client, and will never succeed (regardless of timeout value).

For now, it seems preferable to use a lower timeout, and retry the fetch operation.

I'd say that the following lines are "responsible" for this behavior (nats/js/client.py:L828):

            await self._nc.publish(
                self._nms,
                json.dumps(next_req).encode(),
                self._deliver,
            )
            # Wait for the response or raise timeout.
            msg = await self._sub.next_msg(timeout)

This publish operation never fails, whether a message is effectively sent to NATS or not, so we end up waiting for the whole timeout (at least that seem to be the case for the code sample provided in this issue).

I'm wondering whether it is normal that the NATS client does not send the publish once it is reconnected though, naively I would think that the message to be published stays in the pending queue, and on reconnect the message is sent, so NATS will publish the message and the coroutine waiting on the subscription queue (e.g., PullSubscription._sub.next_msg) will still receive the message before the timeout, but that does not seem to be the case ? Maybe that's because flusher loop does not know that server is disconnected, and hits a BrokenPipe error while writing to transport so this message is lost 🤨?

It may be worth knowing that setting the connect option pending_size=0 will prevent the call to publish from succeeding and will make it so that an the error nats.errors.OutboundBufferLimitError: nats: outbound buffer limit exceeded is raised on L828. This way, client will never wait if message is not published.

This last test illustrates very well what is happening (run with pytest tests/test_js.py -k test_pull_subscribe_reconnect -s to see captured stderr in console):

    @async_long_test
    async def test_pull_subscribe_reconnect(self):
        srv = self.server_pool[0]
        nc = NATS()
        await nc.connect(
            max_reconnect_attempts=-1,
            allow_reconnect=True,
            reconnect_time_wait=1,
            pending_size=0,
        )
        js = nc.jetstream()

        await js.add_stream(name="TEST1", subjects=["foo.1", "bar"])

        for i in range(3):
            ack = await js.publish("foo.1", "Hello from NATS!".encode())
            assert ack.stream == "TEST1"
            assert ack.seq == i + 1
        consumer = await js.pull_subscribe("foo.1", "dur")
        for i in range(3):
            srv.stop()
            srv.start()
            await asyncio.sleep(0)
            while True:
                try:
                    msg, *rest = await consumer.fetch(1, timeout=0.5)
                except TimeoutError:
                    if nc.is_connected:
                        print("TIMEOUT - Retrying immediately", file=sys.stderr)
                    else:
                        print(
                            "TIMEOUT - Retrying in 250ms (currently disconnected)",
                            file=sys.stderr,
                        )
                        await asyncio.sleep(0.250)
                except OutboundBufferLimitError:
                    print("NOT CONNECTED - Retrying in 250ms", file=sys.stderr)
                    await asyncio.sleep(0.250)
                else:
                    print("OK - Got message", file=sys.stderr)
                    break
            assert msg.data == b"Hello from NATS!"
            assert msg.metadata.stream == "TEST1"
            await msg.ack()

Captured Output:

[DEBUG] Server listening on port 4222 started.
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.
[DEBUG] Server listening on port 4222 started.
TIMEOUT - Retrying in 250ms (currently disconnected)
NOT CONNECTED - Retrying in 250ms
NOT CONNECTED - Retrying in 250ms
OK - Got message
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.
[DEBUG] Server listening on port 4222 started.
TIMEOUT - Retrying in 250ms (currently disconnected)
NOT CONNECTED - Retrying in 250ms
OK - Got message
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.
[DEBUG] Server listening on port 4222 started.
TIMEOUT - Retrying in 250ms (currently disconnected)
NOT CONNECTED - Retrying in 250ms
NOT CONNECTED - Retrying in 250ms
OK - Got message
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.

Second edit

I'm pretty sure that the request for next message is not published due to an error encountered when flushing (broken pipe). When an error occurs within the flusher task, pending buffer is lost. This has a much broader impact that just pull subscriptions, but if we change the _flusher method to something like that:

              try:
                  if self._pending_data_size > 0:
-                     self.transport.writlines(self._pending[:])
+                    pending = self._pending[:]
                      self._pending = []
                      self._pending_data_size = 0
+                    self._transport.writelines(pending)
-                     await self._transport.drain()
+                    try:
+                        await self._transport.drain()
+                    except BaseException:
+                        if pending:
+                            self._pending = pending + self._pending
+                            self._pending_data_size += len(pending)
+                        raise
            except OSError as e:
                if pending:
                    self._pending_data_size += len(pending)
                    self._pending = pending + self._pending

then messages are indeed published on reconnect and test passes , however, the asyncio.StreamWriter documentation (the transport used in the code samples in the issue) states that:

This is a flow control method that interacts with the underlying IO write buffer. When the size of the buffer reaches the high watermark, drain() blocks until the size of the buffer is drained down to the low watermark and writing can be resumed. When there is nothing to wait for, the drain() returns immediately.

I don't really understand what this means 😅 but I take it as "caller don't know how much of pending data has been written" so I guess the code above may send messages twice if an error occured after some data has been effectively written but before drain completes ? In the end, it seems that users should always consider the possibility that publish is never received by the server.

Regarding the specific case of PullSubscription.fetch(), it seems that it's up to users to make sure that they are using small timeouts with retries instead of None because it will always be possible that message is considered as published by client, but never received by the server.

On nats-py side, it may be worth changing the PullSubscription.fetch() signature to not accept None as a timeout value, because it will wait forever if request failed to be published ?

Tim Keller · Answer 2 · Mon Dec 11 2023 22:58:24 GMT+0800 (China Standard Time)

I can second this bug, it hit is un production when we last did an update.
We managed to catch it quick and reboot but I would expect nats-py to properly handle server reboots!

Some notes:
Latest Nats-py version
Nats server 2.9.24.
On my end this is possible effecting pull subscribe and Request Reply/standard subscriptions.