MessageConsumer does not properly reconnect

Question

MessageConsumer does not properly reconnect

sanjayesn opened this issue 10 months ago · comments

Defect

When constructing a MessageConsumer using the simplified API, turning the server off and on again results in the consumer not being able to receive any more messages. This seems to be a bug in how the consumer reconnects after a disconnect - is this supported in the simplification API or should we use a different API right now if we need to be robust in the face of disconnects/reconnects?

Versions of `NATS.Client` and `nats-server`:

NATS.Client: 1.0.7
nats-server: 2.9.20

OS/Container environment:

WSL (Ubuntu 22.04 LTS).

Steps or code to reproduce the issue:

Using a modified version of the SimplificationMessageConsumer example:

using System;
using System.Diagnostics;
using System.Text;
using System.Threading;
using NATS.Client;
using NATS.Client.JetStream;

namespace NATSExamples
{
    internal static class MessageConsumerExample
    {
        private static readonly string STREAM = "consume-handler-stream";
        private static readonly string SUBJECT = "consume-handler-subject";
        private static readonly string CONSUMER_NAME = "consume-handler-consumer";
        private static readonly int STOP_COUNT = 5;

        private static readonly string SERVER = "nats://localhost:4222";

        public static void Main(String[] args)
        {
            Options opts = ConnectionFactory.GetDefaultOptions(SERVER);

            using (IConnection c = new ConnectionFactory().CreateConnection(opts))
            {
                IJetStreamManagement jsm = c.CreateJetStreamManagementContext();
                IJetStream js = c.CreateJetStreamContext();

                // set's up the stream and publish data
                // JsUtils.CreateOrReplaceStream(jsm, STREAM, SUBJECT);
                // in case the stream was here before, we want a completely new one
                try
                {
                    jsm.DeleteStream(STREAM);
                }
                catch (Exception)
                {
                }

                jsm.AddStream(StreamConfiguration.Builder()
                    .WithName(STREAM)
                    .WithSubjects(SUBJECT)
                    .Build());
                // get stream context, create consumer and get the consumer context
                IStreamContext streamContext;
                IConsumerContext consumerContext;
                try
                {
                    streamContext = c.CreateStreamContext(STREAM);
                    consumerContext = streamContext.CreateOrUpdateConsumer(ConsumerConfiguration.Builder().WithDurable(CONSUMER_NAME).Build());
                }
                catch (Exception)
                {
                    // possible exceptions
                    // - a connection problem
                    // - the stream or consumer did not exist
                    return;
                }

                CountdownEvent latch = new CountdownEvent(1);
                int count = 0;
                Stopwatch sw = Stopwatch.StartNew();
                EventHandler<MsgHandlerEventArgs> handler = (s, e) =>
                {
                    Console.WriteLine("Handler got a message...");
                    Thread.Sleep(1000);
                    e.Message.Ack();
                    if (count == STOP_COUNT)
                    {
                        latch.Signal();
                    }
                };

                using (IMessageConsumer consumer = consumerContext.Consume(handler))
                {
                    latch.Wait();
                    // once the consumer is stopped, the client will drain messages
                    Console.WriteLine("Stop the consumer...");
                    consumer.Stop(1000);
                    Thread.Sleep(1000); // enough for messages to drain after stop
                }

                Console.WriteLine("Done!");
            }
        }
    }
}

Start running the example, then publish messages via the CLI using the command nats pub consume-handler-subject Hello-World. After doing this a few times, stop the NATS server and restart it, then run the CLI publish command a few more times.

Expected result:

I expected that after stopping and restarting the NATS server and publishing more messages, the message consumer would receive the newly published messages.

Actual result:

Handler got a message...
Handler got a message...
Handler got a message...
PullStatusError, Connection: 30, Subscription: 2, ConsumerName:consume-handler-consumer, Status: Status 409 Server Shutdown
DisconnectedEvent, Connection: 30
ReconnectedEvent, Connection: 18
HeartbeatAlarm, Connection: 18, Subscription: 2, ConsumerName:consume-handler-consumer, lastStreamSequence: 3, lastConsumerSequence: 3
HeartbeatAlarm, Connection: 18, Subscription: 2, ConsumerName:consume-handler-consumer, lastStreamSequence: 3, lastConsumerSequence: 3
HeartbeatAlarm, Connection: 18, Subscription: 2, ConsumerName:consume-handler-consumer, lastStreamSequence: 3, lastConsumerSequence: 3

The consumer properly receives the first few messages and upon server restart, the consumer seems to reconnect with the ReconnectedEvent message. However, publishing more messages via the CLI does not result in the consumer getting the messages, and instead HeartbeatAlarm messages start popping up in the logs at a frequency of ~40 seconds. In other words, the MessageConsumer fails to properly reconnect.

Scott Fauerbach · Answer 1 · Wed Aug 02 2023 00:51:41 GMT+0800 (China Standard Time)

How many servers in your cluster? Can you try these experiments?

Experiment 1. Kill server with the consumer leader if it's not the same server you are connected to...
Experiment 2. Kill server you are connected to if it's not the same as the consumer leader...
Experiment 3. Kill server you are connected if it is the same as the consumer leader...

It's possible that the server with your consumer leader is gone and until it comes back, the consumer will not continue. There is nothing that the client can do in that case except wait for that server to come up, or to make a new almost identical consumer except starting at stream sequence 1 greater than the last message you successfully read.

Sanjaye Narayan · Answer 2 · Wed Aug 02 2023 00:58:01 GMT+0800 (China Standard Time)

I was just trying this locally with one server - in this case when the server comes back up, it is by default the leader right?

Scott Fauerbach · Answer 3 · Wed Aug 02 2023 01:06:54 GMT+0800 (China Standard Time)

Is it a memory stream as in the code? How can you tell that the cli is publishing? Can you do nats stream info or nats s info

Sanjaye Narayan · Answer 4 · Wed Aug 02 2023 01:12:34 GMT+0800 (China Standard Time)

Good point, I just removed the .WithStorageType(StorageType.Memory) line and the issue is the same.

I can confirm the cli is publishing because when I publish and then run nats stream info the message count goes up.

Information for Stream consume-handler-stream created 2023-07-31 19:37:15

             Subjects: consume-handler-subject
             Replicas: 1
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: unlimited
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited


State:

             Messages: 5
                Bytes: 320 B
             FirstSeq: 1 @ 2023-08-01T02:37:31 UTC
              LastSeq: 5 @ 2023-08-01T02:40:51 UTC
     Active Consumers: 1
   Number of Subjects: 1

Sanjaye Narayan · Answer 5 · Thu Aug 03 2023 00:27:26 GMT+0800 (China Standard Time)

Just following up on this - are you able to recreate this issue locally? I also confirmed that we can circumvent this issue using the legacy API, with this slightly modified snippet, so would you recommend reverting to the legacy API?

using System;
using System.Diagnostics;
using System.Text;
using System.Threading;
using NATS.Client;
using NATS.Client.JetStream;

namespace NATSExamples
{
    internal static class MessageConsumerExample
    {
        private static readonly string STREAM = "consume-handler-stream";
        private static readonly string SUBJECT = "consume-handler-subject";
        private static readonly string CONSUMER_NAME = "consume-handler-consumer";
        private static readonly int STOP_COUNT = 5;

        private static readonly string SERVER = "nats://localhost:4222";

        public static void Main(String[] args)
        {
            Options opts = ConnectionFactory.GetDefaultOptions(SERVER);

            using (IConnection c = new ConnectionFactory().CreateConnection(opts))
            {
                IJetStreamManagement jsm = c.CreateJetStreamManagementContext();
                IJetStream js = c.CreateJetStreamContext();

                // set's up the stream and publish data
                // JsUtils.CreateOrReplaceStream(jsm, STREAM, SUBJECT);
                // in case the stream was here before, we want a completely new one
                try
                {
                    jsm.DeleteStream(STREAM);
                }
                catch (Exception)
                {
                }

                jsm.AddStream(StreamConfiguration.Builder()
                    .WithName(STREAM)
                    .WithSubjects(SUBJECT)
                    .Build());
                // get stream context, create consumer and get the consumer context
                IStreamContext streamContext;
                IConsumerContext consumerContext;
                try
                {
                    streamContext = c.CreateStreamContext(STREAM);
                    consumerContext = streamContext.CreateOrUpdateConsumer(ConsumerConfiguration.Builder().WithDurable(CONSUMER_NAME).Build());
                }
                catch (Exception)
                {
                    // possible exceptions
                    // - a connection problem
                    // - the stream or consumer did not exist
                    return;
                }

                CountdownEvent latch = new CountdownEvent(1);
                int count = 0;
                Stopwatch sw = Stopwatch.StartNew();
                EventHandler<MsgHandlerEventArgs> handler = (s, e) =>
                {
                    Console.WriteLine("Handler got a message...");
                    Thread.Sleep(1000);
                    e.Message.Ack();
                    if (count == STOP_COUNT)
                    {
                        latch.Signal();
                    }
                };

                Console.WriteLine("\nC. Legacy Pull Subscription then Iterate");
                PullSubscribeOptions pullSubscribeOptions = PullSubscribeOptions.Builder().Build();
                IJetStreamPullSubscription usub = js.PullSubscribe(SUBJECT, pullSubscribeOptions);

                while (true) 
                {
                    try 
                    {
                        IList<Msg> messages = usub.Fetch(10, 2000);
                        foreach (Msg msg in messages)
                        {
                            Console.WriteLine("Handler got a message...");
                            msg.Ack();
                        }
                    }
                    catch
                    {
                        continue;
                    }
                }
            }
        }
    }
}

Using the legacy API and turning the NATS server off and on again results in the consumer properly reconnecting and getting new messages after the server restart:

C. Legacy Pull Subscription then Iterate
Handler got a message...
Handler got a message...
Handler got a message...

PullStatusError, Connection: 27, Subscription: 2, ConsumerName:CZn0tJ_WAU, Status: Status 409 Server Shutdown
DisconnectedEvent, Connection: 27
ReconnectedEvent, Connection: 25
Handler got a message...
Handler got a message...
Handler got a message...
^C%

Scott Fauerbach · Answer 6 · Thu Aug 03 2023 00:31:04 GMT+0800 (China Standard Time)

So I think you demonstrated the problem. It's not the difference between legacy api. You issue a pull again if there is an issue. So in the "new" api, if there is an issue, you have to restart the consumer.
I'll mention this to the other client devs and ask if we want that level of recovery in the new api.

Scott Fauerbach · Answer 7 · Thu Aug 03 2023 00:34:22 GMT+0800 (China Standard Time)

I posted this in our private channel:

How is recovery handled while (endless) consuming? For example, lets say I start an endless consume, and the server I'm connected to goes does down. This means the pull or pulls that were in progress are gone. Is it expected that the endless consume stay aware of connection and restart pulls? Or is it up to the user to handle the error (they can listen to connection errors and heartbeat errors), and restart the consume?

Sanjaye Narayan · Answer 8 · Thu Aug 03 2023 00:52:00 GMT+0800 (China Standard Time)

Awesome, thanks! I'm just wondering how to restart the MessageConsumer when one of these errors occurs, because the heartbeat/disconnect handlers are set at a higher level (the NATS client) than the MessageConsumer (ConsumeOptions passed to the MessageConsumer don't allow us to configure these handlers). With the legacy API, we directly get an exception but there doesn't seem to be such a direct way of knowing an issue with a particular consumer has happened in the Simplified API.

Scott Fauerbach · Answer 9 · Thu Aug 03 2023 01:05:25 GMT+0800 (China Standard Time)

If it's a durable consumer... just do this again.

using (IMessageConsumer consumer = consumerContext.Consume(handler))

The consumerContext is just a representation of the real server side consumer.

For an ephemeral consumer, you would probably need an entire new ConsumerContext. If the actual consumer's inactive threshold is long enough, and the server that went down was not the consumer leader, and you know the consumer name, you might be able to just do .Consume again

As far as error handlers, it's the same connection object that you configured, so they are still there.

Heartbeat is restarted every time a raw pull request is made anyway, which happens repeatedly inside the Consume.

Sanjaye Narayan · Answer 10 · Thu Aug 03 2023 01:19:12 GMT+0800 (China Standard Time)

We're using durable consumers, but the place where we would have to restart the consumer by calling Consume again would be in the error handler right? This is problematic for us because we have a NATS client library that creates durable MessageConsumers and returns them to other services (which handle stopping the consumers). Moreover, if we have multiple MessageConsumers listening to different streams/subjects, how can we tell in the error handler which one needs to be restarted? It sounds like we'll need to maintain a lot more state to support this, so it seems like reverting to the legacy API is our best bet until this process is simplified further.

Sanjaye Narayan · Answer 11 · Thu Aug 03 2023 01:40:23 GMT+0800 (China Standard Time)

Your suggestion also seems to contradict the docs around automatic reconnections: https://docs.nats.io/using-nats/developer/connecting/reconnect. If new pulls are happening repeatedly inside the Consume, why don't these pulls handle reconnection in the same way as doing another pull in the legacy API?

Scott Fauerbach · Answer 12 · Thu Aug 03 2023 07:58:27 GMT+0800 (China Standard Time)

I think it's reasonable for the service class to be the error handler. It can start and stop consumers as it needs to and have all the state it needs. I think your "service" needs to be everything, not just passing a MessageConsumer to something else or as you've discovered you will have to do some code gymnastics

I would instead just have my services be extensions of a base service that knows how to connect and recover, make a consumer, etc. Depending on your use case, it's not unreasonable for each service to have it's own connection. This will distribute connections/load across servers, reducing problems to only the services that are connected to the downed server and also probably making things faster.

How can you tell which need to be restarted? Probably all of them. I think all would be in a similar state after disconnection on the same connection.

As far as contradicting the docs, the legacy fetch is exactly one raw pull.
In your example you saw that the fetch/pull didn't survive the disconnect.
The new simplification api use continuous raw pulls under the covers and are more complicated, but again could be better aware of failing heartbeats, that's the question I posed. But there is a fine line of how much we should build in to the api. I promise you we are discussing it as a team to find that line.

hanlong-chen-1047 · Answer 13 · Tue Aug 08 2023 10:21:41 GMT+0800 (China Standard Time)

What exactly would we need to be listening to to trigger a restart of the consumers? It seems like every user of long lived / durable consumers would need to implement such client-side functionality. Is there an example of this usage anywhere?

Scott Fauerbach · Answer 14 · Tue Aug 08 2023 18:32:52 GMT+0800 (China Standard Time)

I'm working on adding better recovery, but the DisconnectedEventHandler, ReconnectedEventHandler and HeartbeatAlarmEventHandler are places to start.

hanlong-chen-1047 · Answer 15 · Wed Aug 09 2023 00:02:10 GMT+0800 (China Standard Time)

Do you mean that all three of those handlers need to be implemented to restart the consumer if they fire?

hanlong-chen-1047 · Answer 16 · Thu Aug 24 2023 05:55:18 GMT+0800 (China Standard Time)

Is there any timeline for when this API might be updated?

Walms · Answer 17 · Mon Nov 13 2023 04:58:24 GMT+0800 (China Standard Time)

Your suggestion also seems to contradict the docs around automatic reconnections: https://docs.nats.io/using-nats/developer/connecting/reconnect. If new pulls are happening repeatedly inside the Consume, why don't these pulls handle reconnection in the same way as doing another pull in the legacy API?

Yeah the docs led me down the wrong path too. If it wasn't for this issue I'd still be sure I'd hit a bug.
If I have to reconnect that is fine, but the documentation explicitly says I don't have to.

Scott Fauerbach · Answer 18 · Mon Nov 13 2023 22:45:27 GMT+0800 (China Standard Time)

There certainly is confusion in the docs. But when you make a raw pull, it is a one time command to the server to send messages with certain time/byte/message count parameters. If that pull is cutoff in the middle, there is no way for it to resume.
The plan for simplification, which makes repeated use of raw pulls, is to recognize this situation and react to it by issuing a new pull to replace the failed one. Currently the plan is to wait for a [raw pull request] heartbeat error to know that the pull is not going to finish.

Walms · Answer 19 · Tue Nov 14 2023 04:50:50 GMT+0800 (China Standard Time)

Maybe I'm hitting something different then? I have a push consumer. On reconnect I'd expect future messages to get picked up by my message handler.

Scott Fauerbach · Answer 20 · Tue Nov 14 2023 05:37:05 GMT+0800 (China Standard Time)

First. This whole time I thought we were talking about pull the entire time so let's clarify.

There is push consumer, where the server "pushes" messages to the client.
There is pull consumer, where the client must request messages in batches that can be limited by time, count and bytes.
The simplification macro consuming use pull consumers under the covers.

Regarding push consumers... As it turns out I'm working on a push consumers example. Take a look at this piece of code I put together: https://github.com/nats-io/java-nats-examples/tree/main/robust-push-subscription

The thing is, many times the subscription will be able to resume, but there are several variables to consider.

What is the inactive threshold of the consumer if it's ephemeral and not durable
What type of stream storage is there, file or memory
Was the server that went down the stream leader or the consumer leader.
Was that server also the one the client was connected to.
What is the replication factor of the stream

The example I'm building relies on the heartbeat alarm warning to recognize when the consumer is no longer receiving messages. This is the same way the simplification consumers will recognize a stalled consumer. Pull consumers do not recover like push consumers because a pull request will get lost when it's server goes down. The simplification endless consumers will try to recognize this and make a fresh pull.

MessageConsumer does not properly reconnect

Defect

Versions of NATS.Client and nats-server:

OS/Container environment:

Steps or code to reproduce the issue:

Expected result:

Actual result:

Versions of `NATS.Client` and `nats-server`: