[BUG] Upgrade stucks in node `Pre-draining` due to waiting volume rebuilding

Question

[BUG] Upgrade stucks in node `Pre-draining` due to waiting volume rebuilding

albinsun opened this issue 4 months ago · comments

Albin Sun commented 4 months ago

Describe the bug
v1.2.1 -> v1.2.2-rc3 upgrade stucks in node Pre-draining due to wait volume rebuilding.

To Reproduce

Note

Not always reproducible and current reproducibility is 1/5.

Setup 3 node harvester-v1.2.1
Enable rancher-monitoring addon
Import Harvester into rancher-v2.7.11
Create a RKE2 cluster, deploy nginx and LB
🔴 Upgrade Harvester to v1.2.2-rc3

Stuck in Pre-draining due to waiting volume rebuilding

Expected behavior
Upgrade successfully

Support bundle
support-bundle-stuckRebuilding.zip

Upgrade log
hvst-upgrade-p4kfq-upgradelog-archive-stuckRebuilding.zip

Environment

Harvester
- Version: v1.2.1 -> v1.2.2-rc3
- Profile: QEMU/KVM, 3 nodes (8C/16G/500G)
- ui-source: Auto
Rancher
- Version: v2.7.11
- Profile: Helm(K3s) in QEMU/KVM (2C/4G)

Additional context

node-2 Cordoned and stuck Pre-draining
A volume replica on node-2 is stuck Rebuilding...

instance-manager-df64d0429b56f6f2f48b2c7150c32f38

[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1] time="2024-05-11T11:46:35Z" 
level=warning msg="Failed to unmap" func="controller.(*Controller).UnmapAt" 
file="control.go:959" error="cannot unmap 188416 bytes at offset 15943786496 while rebuilding is in progress"

longhorn-manager

time="2024-05-11T11:48:42Z" level=info msg="Skipped rebuilding of replica because there is another rebuild in progress" 
func="controller.(*EngineController).rebuildNewReplica" 
file="engine_controller.go:1702" controller=longhorn-engine node=harvester-node-0 volume=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80

Albin Sun · Answer 1 · Sat May 11 2024 22:26:47 GMT+0800 (China Standard Time)

Rebuilding becomes Replica scheduling failed afterward.
hvst-upgrade-replicaSchedulingFailed.zip
supportbundle_replicaSchedulingFailed.zip

Albin Sun · Answer 2 · Mon May 13 2024 15:56:13 GMT+0800 (China Standard Time)

FYI, have 3 more trials and does not hit this issue, decreasing reproducibility.

David Ko · Answer 3 · Mon May 13 2024 22:50:22 GMT+0800 (China Standard Time)

Well tested @albinsun

Eric Weber · Answer 4 · Tue May 14 2024 06:50:15 GMT+0800 (China Standard Time)

Still analyzing from the Longhorn side. No conclusion yet. Sorry for the delay!

Jian Wang · Answer 5 · Tue May 14 2024 15:36:34 GMT+0800 (China Standard Time)

Such logs are observed.

@albinsun Was this tested on air-gapped environment?

@starbops Could this be related to the auto-cleaned image, and your latest PR #5750 add LH related images to reserved list?

pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: engine is not running" node=harvester-node-0

 name: pvc-556f655d-7008-4750-a48c-99416b19dd8f

rep on node2: pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede

engineName: pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb

longhorn-manager-9tbdg/longhorn-manager.log:2024-05-11T10:04:49.369802669Z time="2024-05-11T10:04:49Z" level=error msg="Failed to sync Longhorn replica" func=controller.handleReconcileErrorLogging file="utils.go:67" Replica=longhorn-system/pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: failed to get instance manager for instance pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: cannot find the only available instance manager for instance pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede, node harvester-node-2, instance manager image longhornio/longhorn-instance-manager:v1.5.5, type aio" node=harvester-node-2

longhorn-manager-m82f9/longhorn-manager.log.1:2024-05-11T10:05:33.731269148Z time="2024-05-11T10:05:33Z" level=warning msg="Failed to get engine proxy of pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb for volume pvc-556f655d-7008-4750-a48c-99416b19dd8f" func="metrics_collector.(*VolumeCollector).Collect" file="volume_collector.go:192" collector=volume error="failed to get binary client for engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: cannot get client for engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: engine is not running" node=harvester-node-0

Jian Wang · Answer 6 · Tue May 14 2024 15:53:46 GMT+0800 (China Standard Time)

The LH engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb, currentState: stopped.

    name: pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb
    namespace: longhorn-system
    ownerReferences:
    - apiVersion: longhorn.io/v1beta2
      kind: Volume
      name: pvc-556f655d-7008-4750-a48c-99416b19dd8f
      uid: d5e0a640-93d8-49d4-af6d-66f4522843b0
    resourceVersion: "948228"
    uid: e71be425-99ad-42e8-bd11-92e2718e6b53
  spec:
    active: true
    backendStoreDriver: v1
    backupVolume: "null"
    desireState: stopped
    disableFrontend: false
    engineImage: longhornio/longhorn-engine:v1.5.5
    frontend: blockdev
    logRequested: false
    nodeID: "null"
    replicaAddressMap:
      pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-38fd9502: 10.52.1.14:10075
      pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-3e3032fc: 10.52.0.50:10075
      pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: 10.52.2.11:10090
    requestedBackupRestore: "null"

    currentSize: "10737418240"
    currentState: stopped
    endpoint: "null"

Albin Sun · Answer 7 · Tue May 14 2024 16:18:07 GMT+0800 (China Standard Time)

Such logs are observed.

@albinsun Was this tested on air-gapped environment?
...

No but ipxe-example env. on my local machine.

Jian Wang · Answer 8 · Tue May 14 2024 20:04:50 GMT+0800 (China Standard Time)

@albinsun The root cause of replica of pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb built failed maybe related to below message:

longhorn-manager-m82f9/longhorn-manager.log:2024-05-11T13:32:02.444613957Z time="2024-05-11T13:32:02Z" level=error msg="There's no available disk for replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb, size 42949672960" func="scheduler.(*ReplicaScheduler).ScheduleReplica" file="replica_scheduler.go:101"

Albin Sun · Answer 9 · Tue May 14 2024 20:18:19 GMT+0800 (China Standard Time)

@albinsun The root cause of replica of pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb built failed maybe related to below message:

longhorn-manager-m82f9/longhorn-manager.log:2024-05-11T13:32:02.444613957Z time="2024-05-11T13:32:02Z" level=error msg="There's no available disk for replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb, size 42949672960" func="scheduler.(*ReplicaScheduler).ScheduleReplica" file="replica_scheduler.go:101"

Oh ok, it's quite possible since the node only has 500G disk and the hit run do has more backup/restore tests than others.
It's good if this is just an env. issue.
Thank you @w13915984028.

Jian Wang · Answer 10 · Tue May 14 2024 21:30:33 GMT+0800 (China Standard Time)

BTW, In the support-bundle, there are some middle-state related error messages which distracted my attention.

But for pvc-9201003a-3a04-4ab2-bb17-0e505447dc80, the no available disk is the root cause. We are safe to proceed.

Albin Sun · Answer 11 · Tue May 14 2024 22:40:45 GMT+0800 (China Standard Time)

Close as environment issue.
Will pay more attention on this kind of exception next time.
Sorry for inconvenience and thank you for the help.

freeze · Answer 12 · Tue May 14 2024 22:58:47 GMT+0800 (China Standard Time)

Thanks @w13915984028,

I focused on the wrong SB. The latest one is on #5789 (comment) instead of the #5789 (comment)

And the root cause, like the @w13915984028 mentioned. It is related to the test environment. Thanks!

Kiefer Chang · Answer 13 · Tue May 14 2024 23:00:37 GMT+0800 (China Standard Time)

Thanks @w13915984028 and @Vicente-Cheng, I removed the milestone from the issue.

Eric Weber · Answer 14 · Wed May 15 2024 06:35:21 GMT+0800 (China Standard Time)

@w13915984028, good catch! I agree that the reason for the later "replica scheduling failed" is a lack of space as you described. It's a bit weird that we only hit it after the upgrade IMO, but I didn't investigate this too much.

@bk201, @Vicente-Cheng, and @bk201, after a full analysis of the first support bundle, I think I have identified two Longhonrn related issues that led to behavior @albinsun observed. I ran out of time to organize my notes into a detailed writeup, but for now, they are:

When the migration engine pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1 was created, it almost immediately set two of its migration replicas to ERR due to a revision counter mismatch. Longhorn (probably correctly) did not update the fields of migration replica CRs to reflect this while the migration was ongoing, but these two replicas immediately failed when the migration was complete. This is an indirect cause of the behavior @albinsun observed, because it led to the rebuilding that never completed.
Once the migration was complete (from harvester-node-1 to harvester-node-0), the two failed replicas immediately started to be rebuilt by the surviving replica (which happened to be on harvester-node-1. One replica was successfully rebuilt, but harvester-node-1 was restarted after the second rebuild started and before it could complete. Because of the way files are synced during a rebuild (details to come), the loss of the source replica did not trigger a rebuild failure in a way that propagated up the stack to longhorn-manager. It did not realize the rebuild failed for ~2 hours and 15 minutes. (I am pretty sure longhorn-manager was finally notified when the Linux TCP stack closed the connection between the rebuilding replica and its source client-side). This is the direct cause of the behavior @albinsun observed, which didn't resolve itself until well into the second support bundle. I think it is because we only cancel the file syncing connection if:
- The source replica does not contact the receiver AT ALL before its 1:30 second idle timer expires (https://github.com/longhorn/longhorn-engine/blob/a807f0fd6bfd9c4700f2c19808038e87a2ab814e/vendor/github.com/longhorn/sparse-tools/sparse/rest/server.go#L78-L117). This causes us to cancel from the receiving side.
- The source replica times out while trying to send a chunk of the file to the receiver according to HTTPClientTimeout, which defaults to 30 seconds (https://github.com/longhorn/longhorn-engine/blob/a807f0fd6bfd9c4700f2c19808038e87a2ab814e/vendor/github.com/longhorn/sparse-tools/sparse/client.go#L97-L120). This causes us to cancel from the sending side.
- In this case, the sender went away after establishing a connection, so I guess we did not detect it? We can investigate more and see if it is reproducible.

Additionally, we see a lot of contention between longhorn-managers trying to take ownership of the same object back and forth from each other at times. We may be able to improve something here as well.

I don't think these issues are likely to be caused by a regression. It seems likely to me that they combined to produce a behavior that is not very reproducible. I will create corresponding Longhorn issues as soon as I can.

Jian Wang · Answer 15 · Wed May 15 2024 15:43:39 GMT+0800 (China Standard Time)

@ejweber Excellent. The support bundle gives many interesting information from LH CRD objects and logs. It is worthy to filter each clue and optimize/enhance correspondingly. Thanks.

freeze · Answer 16 · Thu May 16 2024 01:14:07 GMT+0800 (China Standard Time)

Thanks, @ejweber
I also noticed the client restarted and did not give any response from the below investigation.

Let's focus on the replica re-create on 2024-05-11T10:33:19, replica name: pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0

2024-05-11T10:33:19.023742976Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Adding replica" func="proxy.(*Proxy).ReplicaAdd" file="replica.go:33" currentSize=42949672960 engineName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1 fastSync=true replicaAddress="tcp://10.52.2.56:10031" replicaName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 restore=false serviceURL="10.52.0.128:10092" size=42949672960 volumeName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80
2024-05-11T10:33:19.043624243Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Adding replica tcp://10.52.2.56:10031 in WO mode" func="sync.(*Task).AddReplica" file="sync.go:422"
.
.
.
2024-05-11T10:33:19.097305657Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Using replica tcp://10.52.1.62:10080 as the source for rebuild" func="sync.(*Task).getTransferClients" file="sync.go:574"
2024-05-11T10:33:19.097616219Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Using replica tcp://10.52.2.56:10031 as the target for rebuild" func="sync.(*Task).getTransferClients" file="sync.go:579"
.
.
.
// rebuilding
2024-05-11T10:33:19.180751990Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3] time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-head-003.img.meta to 10.52.2.56:10034" func="rpc.(*SyncAgentServer).FileSend" file="server.go:342"
2024-05-11T10:33:19.180807240Z time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-head-003.img.meta to 10.52.2.56:10034: size 178, directIO false, fastSync false" func=sparse.SyncFile file="client.go:110"

Note, 10.52.1.62:10080 means the replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a
Until now, we have generated the sync file list (the snap chain). Then, we try to sync files between node1 (source) and node2 (target).

For the sync file behavior, we could imagine that the source side creates a server, and the client side sends the file to the server.
Server: https://github.com/longhorn/longhorn-engine/blob/master/pkg/sync/rpc/server.go#L466
Client: https://github.com/longhorn/longhorn-engine/blob/master/pkg/sync/rpc/server.go#L470-L472

We can find the corresponding logs on
target side (10.52.2.56), instance-manager-dac598cd4a1b746493fc409f60eaf07a
source side (10.52.1.62), `instance-manager-

// server
2024-05-11T10:33:19.304817291Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:19Z" level=info msg="Running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img at port 10035" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"

// client
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img to 10.52.2.56:10035" func="rpc.(*SyncAgentServer).FileSend" file="server.go:342"
time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img to 10.52.2.56:10035: size 42949672960, directIO true, fastSync true" func=sparse.SyncFile file="client.go:110"
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=warning msg="Failed to get change time and checksum of local file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img" func=sparse.SyncContent file="client.go:149" error="failed to open checksum file: open volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.checksum: no such file or directory"

// server
2024-05-11T10:33:20.038236353Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:19Z" level=info msg="Done running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img at port 10035" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"

2024-05-11T10:33:20.038295404Z time="2024-05-11T10:33:19Z" level=info msg="Running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.meta at port 10036" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"
2024-05-11T10:33:20.066432364Z time="2024-05-11T10:33:20Z" level=info msg="Done running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.meta at port 10036" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"


2024-05-11T10:33:20.083299991Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:20Z" level=info msg="Running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"

// looks like server timeout here
2024-05-11T10:36:00.997806717Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=error msg="Shutting down the server since it is idle for 1m30s" func=rest.Server.func1 file="server.go:111"
2024-05-11T10:36:01.011643136Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=info msg="Done running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"


// finally get the error, then replica recreate
2024-05-11T12:45:27.560746459Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T12:45:27Z" level=error msg="Sync agent gRPC server failed to rebuild replica/sync files" func="rpc.(*SyncAgentServer).FilesSync.func1" file="server.go:427" error="replica tcp://10.52.1.62:10080 failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: rpc error: code = Unavailable desc = error reading from server: read tcp 10.52.2.56:40316->10.52.1.62:10082: read: connection timed out"

There are some points I would like to figure out more, or @ejweber would like to give some perspective from LH side.

Is the following log no harm? Looks like just a warning and no error after that.

[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=warning msg="Failed to get change time and checksum of local file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img" func=sparse.SyncContent file="client.go:149" error="failed to open checksum file: open volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.checksum: no such file or directory"

I saw the instance-manager-fc152a26b22a0d1d244d139fc8acceda restart on around 2024-05-11T10:33:31Z - 2024-05-11T10:42:20Z
I wonder if it causes the client not to reply to the error because the GRC server is already gone (with restart).
But the timeout value looks like 2 hours from this log.

2024-05-11T12:45:27.560746459Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T12:45:27Z" level=error msg="Sync agent gRPC server failed to rebuild replica/sync files" func="rpc.(*SyncAgentServer).FilesSync.func1" file="server.go:427" error="replica tcp://10.52.1.62:10080 failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: rpc error: code = Unavailable desc = error reading from server: read tcp 10.52.2.56:40316->10.52.1.62:10082: read: connection timed out"

I thought the GRPC client timeout was set to 24 hours, so I have no idea about the above 2 hours timeout.
And I thought that was an edge case. It's hard to reproduce w/o any specific config.

Eric Weber · Answer 17 · Thu May 16 2024 03:09:47 GMT+0800 (China Standard Time)

You are right @Vicente-Cheng. I had noticed the following log before, but didn't fit it into my analysis.

2024-05-11T10:36:01.011643136Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=info msg="Done running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"

This log indicates the ssync server (file receiver) running in pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 timed out. However, since the server launched successfully, it is not an error within SyncFiles. The error only occurs when the file sender, pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3, finally fails to send the file AND the file receiver recognizes the failure. It's something like this:

pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 is the SyncAgentServer responsible for handling the file sync. It is the replica that needs to be rebuilt.
pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 launches a receiver. https://github.com/longhorn/longhorn-engine/blob/7dbeb34fb049b1b0ca80c76d5c684b09c6d8b097/pkg/sync/rpc/server.go#L464
pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 sends a SendFile request to pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3. https://github.com/longhorn/longhorn-engine/blob/7dbeb34fb049b1b0ca80c76d5c684b09c6d8b097/pkg/sync/rpc/server.go#L468-L470
As you mentioned, the SendFile request has the 24 hour GRPCServiceLongTimeout. It means the request will be canceled if it is not complete within 24 hours. https://github.com/longhorn/longhorn-engine/blob/7dbeb34fb049b1b0ca80c76d5c684b09c6d8b097/pkg/replica/client/client.go#L441-L460
Presumably a TCP connection is established between the two replicas as part of the SendFile request, before pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3 disappears.
At this point, pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 is just waiting for the SendFile request to complete. There is nothing for it to do. pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3 is gone, but it wasn't necessarily expected to return anything over the connection for a long time.
I think we finally receive an error after ~2 hr and 15 minutes because that is how long it takes Linux to complete its TCP keepalive behavior. It is not something currently built into our code. I don't know the values on the QA system, but on mine, they are:

eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_time 
7200
eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

That is: 120 min before the first probe, then 9 probes at 75 second intervals, for a total of 10 additional minutes.
After approximately this amount of time, Linux informed pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 that the TCP connection was dead and we got the error.
In instance-manager, we use gRPC keepalives to ensure something like this can't happen, but this connection is deeper into the stack.

freeze · Answer 18 · Fri May 17 2024 01:20:59 GMT+0800 (China Standard Time)

Thanks for the clarification! @ejweber

I thought the 2 hours timeout was related to the TCP keepalive mechanism, as you mentioned.
So the problem looks like when two replicas establish the connection and try to sync the file.
The source (sender) pvc is gone and the target (receiver) side did not receive anything.

In this case, we can only rely on the TCP timeout. Do we need another mechanism to monitor the connection status when rebuilding?

Eric Weber · Answer 19 · Fri May 17 2024 04:35:24 GMT+0800 (China Standard Time)

Yes I think so. Probably the gRPC keepalive I mentioned could be used on the SendFile/FileSend RPC between the destination replica and the source replica. In this case a keepalive could have recognized the source replica was gone very quickly. And in the case where the source replica is NOT gone, but actually SyncContent is taking a very long time, the source replica would respond to the keepalive ping and the intended behavior would be maintained. I will create an issue in the Longhorn repo about this.

Eric Weber · Answer 20 · Fri May 17 2024 22:50:36 GMT+0800 (China Standard Time)

Additionally, we see a lot of contention between longhorn-managers trying to take ownership of the same object back and forth from each other at times. We may be able to improve something here as well.

For this one, I think longhorn/longhorn#7531 is probably related.