Deleting a nats cluster pod results in peer log: error sending snapshot to follower [xyz]: raft: no snapshot available

Question

Deleting a nats cluster pod results in peer log: error sending snapshot to follower [xyz]: raft: no snapshot available

joriatyBen opened this issue 2 years ago · comments

Issue:

When killing the stream leader of a nats jetstream cluster it seems like RAFT is not synchronizing perfectly. After the deleted peer (which is typically a statefulset pod like nats-0, nats-1, nats-2, nats-n) is recovered it throws the warning JetStream cluster consumer '$G > data> some_durable-connection' has NO quorum, stalled. . While the not deleted peers (pods) throw the logged error Error sending snapshot to follower [xyz]: raft: no snapshot available

This issue results in an not corret working stream. It seems like that the clients are not connected to Jetstream in a correct manner.

When requesting consumer info in nats-box, the following is printed out.
error: could not load Consumer consumer-xyz > consumer_durable-connection: JetStream system temporarily unavailable (10008)

╭─────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                   Stream Report                                                 │
├──────────┬─────────┬───────────┬──────────┬─────────┬──────┬─────────┬──────────────────────────┤
│ Stream   │ Storage │ Consumers │ Messages │ Bytes   │ Lost │ Deleted │ Replicas                 │
├──────────┼─────────┼───────────┼──────────┼─────────┼──────┼─────────┼──────────────────────────┤
│ stream-1 │ Memory  │ 2         │ 23,753   │ 48 MiB  │ 0    │ 0       │ nats-0, nats-1, nats-2*  │
│ stream-2 │ Memory  │ 1         │ 91,459   │ 238 MiB │ 0    │ 0       │ nats-0, nats-1!, nats-2* │
│ stream-3 │ Memory  │ 4         │ 201,431  │ 477 MiB │ 0    │ 0       │ nats-0, nats-1, nats-2*  │
╰──────────┴─────────┴───────────┴──────────┴─────────┴──────┴─────────┴──────────────────────────╯

Experienced with:

Kubernetes 1.21.2 with nats jetstream cluster (3 and 5 statefulset pods) and 3 streams (3 Replicas each), nats jetstream 2.7.2, java client with jnats 2.13.2
nats jetstream was istalled via offical helm chart.

Reproduce:

kill peer pod which is marked as current leader for a stream (kubectl delete pod), checking the peer pod logs.

syncosync · Answer 1 · Wed Mar 09 2022 19:04:29 GMT+0800 (China Standard Time)

hi guys, could you at least reproduce it or do i have to provide more information.

Waldemar Quevedo · Answer 2 · Thu Mar 10 2022 01:52:08 GMT+0800 (China Standard Time)

Hi @bajoben, do you have file storage enabled in your system?
https://github.com/nats-io/k8s/tree/main/helm/charts/nats#setting-up-memory-and-file-storage

Currently you need to set file storage still for the raft logs even if only using memory based streams.

syncosync · Answer 3 · Fri Mar 11 2022 00:17:32 GMT+0800 (China Standard Time)

@wallyqs Thanks for replying. Added this in helm chart values:

  fileStorage:
     enabled: true
     storageDirectory: /data/
     size: 1Gi

Still If i kill a stream leader the stream is not recovering properly. nats: error: could not load Consumer consumer > consumer_durable-connection: JetStream system temporarily unavailable (10008).

Edit: i dont want to confuse, but it seems like that it is working from time to time. I tested it a lot with:

3 Nats Peers and 3 Streams.
9 Nats Peers and 3 Streams

When i kill the leader for the first time the streams are connecting and the peers are synchronizing properly. When i kill another leader then, it does not recover. But then if I kill another arbitrary Nats peer again they recover properly. This seems just flaky.

Logs from a non connecting nats pod peer:


 reloader 2022/03/17 16:21:41 Starting NATS Server Reloader v0.6.3                                                                                                                                                                           │
│ reloader 2022/03/17 16:21:41 Live, ready to kick pid 7 (live, from 7 spec) based on any of 1 files                                                                                                                                          │
│ metrics [36] 2022/03/17 16:21:42.115372 [INF] Prometheus exporter listening at http://0.0.0.0:7777/metrics                                                                                                                                  │
│ nats [7] 2022/03/17 16:54:31.025778 [DBG] 192.168.7.192:37544 - rid:36 - Router Ping Timer                                                                                                                                                  │
│ nats [7] 2022/03/17 16:54:31.025806 [DBG] 192.168.7.192:37544 - rid:36 - Delaying PING due to client activity 0s ago                                                                                                                        │
│ nats [7] 2022/03/17 16:54:32.073033 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Switching to candidate                                                                                                                                           │
│ nats [7] 2022/03/17 16:54:32.073144 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Sending out voteRequest {term:98 lastTerm:95 lastIndex:67 candidate:T8Dbtd3v reply:}                                                                             │
│ nats [7] 2022/03/17 16:54:32.073934 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Received a voteResponse &{term:97 peer:yrzKKRBu granted:true}                                                                                                    │
│ nats [7] 2022/03/17 16:54:32.074976 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Received a voteResponse &{term:97 peer:gcSbH0gR granted:true}                                                                                                    │
│ nats [7] 2022/03/17 16:54:32.075071 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Switching to leader                                                                                                                                              │
│ nats [7] 2022/03/17 16:54:32.075314 [INF] JetStream cluster new consumer leader for '$G > data-svc > data_durable-connection'                                                                            │
│ nats [7] 2022/03/17 16:54:32.076485 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Being asked to catch up follower: "yrzKKRBu"                                                                                                                     │
│ nats [7] 2022/03/17 16:54:32.076519 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Need to send snapshot to follower                                                                                                                                │
│ nats [7] 2022/03/17 16:54:32.076528 [ERR] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Error sending snapshot to follower [yrzKKRBu]: raft: no snapshot available                                                                                       │
│ nats [7] 2022/03/17 16:54:32.076552 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Switching to follower                                                                                                                                            │
│ nats [7] 2022/03/17 16:54:35.185446 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Received a voteRequest &{term:99 lastTerm:0 lastIndex:0 candidate:yrzKKRBu reply:$NRG.R.j6iaHEeS}                                                                │
│ nats [7] 2022/03/17 16:54:35.185489 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Sending a voteResponse &{term:98 peer:T8Dbtd3v granted:false} -> "$NRG.R.j6iaHEeS"                                                                               │
│ nats [7] 2022/03/17 16:54:38.721064 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Switching to candidate                                                                                                                                           │
│ nats [7] 2022/03/17 16:54:38.721104 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Sending out voteRequest {term:100 lastTerm:98 lastIndex:68 candidate:T8Dbtd3v reply:}                                                                            │
│ nats [7] 2022/03/17 16:54:38.722007 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Received a voteResponse &{term:99 peer:yrzKKRBu granted:true}                                                                                                    │
│ nats [7] 2022/03/17 16:54:38.722960 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Received a voteResponse &{term:99 peer:gcSbH0gR granted:true}                                                                                                    │
│ nats [7] 2022/03/17 16:54:38.722993 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Switching to leader                                                                                                                                              │
│ nats [7] 2022/03/17 16:54:38.723111 [INF] JetStream cluster new consumer leader for '$G > data-svc > data_durable-connection'                                                                            │
│ nats [7] 2022/03/17 16:54:38.727490 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Being asked to catch up follower: "yrzKKRBu"                                                                                                                     │
│ nats [7] 2022/03/17 16:54:38.727507 [DBG] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Need to send snapshot to follower                                                                                                                                │
│ nats [7] 2022/03/17 16:54:38.738656 [ERR] RAFT [T8Dbtd3v - C-R3M-RJ0zrdYG] Error sending snapshot to follower [yrzKKRBu]: raft: no snapshot available    ```

Furkan · Answer 4 · Thu Mar 24 2022 23:15:20 GMT+0800 (China Standard Time)

We have the exact same problem. Is there any update on this? If Jetstream stays like this, I don't think it is reliable to use on production in a Kubernetes cluster. Ps. this doesn't happen in streams with file-based storage.

Derek Collison · Answer 5 · Sat Mar 26 2022 00:48:23 GMT+0800 (China Standard Time)

Does the same issue present with 2.7.4, our latest release? We are nearing the release of 2.8.0 as well. Might be worth checking out the release notes, etc.

Also there was a bug we discovered just recently that will allow a stream to take over as a leader when it was still trying to catch up, causing some instability in the system. We observed this under similar situations where pods were being randomly deleted or migrated during a user's torture test.

That code has been fixed, is in main and nightly and will be part of 2.8.0.

I will try to recreate. A few questions.

Are messages still flowing into the system as you delete the pods?
Do the pods automatically come back up? The updated help chart IIRC will watch for /healthz to return ok.
What does nats stream info report for those streams? (NATS cli which is the preferred method to interact with the system non-programatically).
How many consumers per stream? What does nats consumer info show for those?

Thanks for your patience.

Furkan · Answer 6 · Sat Mar 26 2022 06:45:04 GMT+0800 (China Standard Time)

I've reproduced the same issue with 2.7.4 release. We have a test cluster so we can try 2.8.0 too when it gets released.

Yes, problem rises when new cluster node tries to join the cluster and other members of the cluster can't send the snapshot to new member.
Yes, pods automatically come back up.
It doesn't show anything unusual. Do you expect anything specific?
10 consumers per stream. It logs an error:
nats: error: could not load Consumer audio > audio-34: JetStream system temporarily unavailable (10008)

Thanks for your help!

Derek Collison · Answer 7 · Mon Mar 28 2022 07:53:28 GMT+0800 (China Standard Time)

If the storage is not a persisted volume the raft logs backing the memory stream will be lost..

Derek Collison · Answer 8 · Mon Mar 28 2022 08:05:01 GMT+0800 (China Standard Time)

For instance, just running a 3 node cluster on my machine, creating a R3 memory stream and putting 100k messages in it, then killing S1 (but it has a storage directory). Restarting it immediately recreates the last replica for the stream.

If I kill it and remove the storage directory contents for S1, still works fine for me.

However, if I kill S1, remove the storage directory and allow the new server to pick a new name (server_name), or I allow it to pick one itself, it complains about meta raft snapshots and the nats stream info TEST always shows the replica OFFLINE.

Possible the server names are not consistent in this scenario?

Furkan · Answer 9 · Mon Mar 28 2022 19:05:40 GMT+0800 (China Standard Time)

Storage is persisted within a PVC in Kubernetes.

Server names are consistent because we installed the Nats Jetstream Cluster via Helm chart which uses StatefulSets in Kubernetes. That gives every pod a unique non-changing name. For example:

Jetstream Cluster member 1 server/pod name: nats-jetstream-0
Jetstream Cluster member 2 server/pod name: nats-jetstream-1
Jetstream Cluster member 3 server/pod name: nats-jetstream-2

Also, this is the config I'm using for Jetstream deployment:
https://gist.github.com/furkansb/e74384d3119935783bf80cced3ff2176

Derek Collison · Answer 10 · Mon Mar 28 2022 21:20:10 GMT+0800 (China Standard Time)

Ah, ok my apologies. Will take a look at config.

Derek Collison · Answer 11 · Mon Mar 28 2022 21:24:09 GMT+0800 (China Standard Time)

Hey @caleblloyd and @wallyqs could you look at the config used above, I placed some comments there and was curious.

Waldemar Quevedo · Answer 12 · Fri Jan 06 2023 16:46:16 GMT+0800 (China Standard Time)

Reported as fixed here: https://gist.github.com/furkansb/e74384d3119935783bf80cced3ff2176?permalink_comment_id=4141623#gistcomment-4141623