harvester / harvester

Open source hyperconverged infrastructure (HCI) software

Home Page:https://harvesterhci.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Upgrade stucks in node `Pre-draining` due to waiting volume rebuilding

albinsun opened this issue · comments

Describe the bug
v1.2.1 -> v1.2.2-rc3 upgrade stucks in node Pre-draining due to wait volume rebuilding.
image

To Reproduce

Note

Not always reproducible and current reproducibility is 1/5.

  1. Setup 3 node harvester-v1.2.1
  2. Enable rancher-monitoring addon
  3. Import Harvester into rancher-v2.7.11
  4. Create a RKE2 cluster, deploy nginx and LB
  5. 🔴 Upgrade Harvester to v1.2.2-rc3

    Stuck in Pre-draining due to waiting volume rebuilding

Expected behavior
Upgrade successfully

Support bundle
support-bundle-stuckRebuilding.zip

Upgrade log
hvst-upgrade-p4kfq-upgradelog-archive-stuckRebuilding.zip

Environment

  • Harvester
    • Version: v1.2.1 -> v1.2.2-rc3
    • Profile: QEMU/KVM, 3 nodes (8C/16G/500G)
    • ui-source: Auto
  • Rancher
    • Version: v2.7.11
    • Profile: Helm(K3s) in QEMU/KVM (2C/4G)

Additional context

  1. node-2 Cordoned and stuck Pre-draining
    image

  2. A volume replica on node-2 is stuck Rebuilding...
    image

  3. instance-manager-df64d0429b56f6f2f48b2c7150c32f38

    [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1] time="2024-05-11T11:46:35Z" 
    level=warning msg="Failed to unmap" func="controller.(*Controller).UnmapAt" 
    file="control.go:959" error="cannot unmap 188416 bytes at offset 15943786496 while rebuilding is in progress"
    

    image

  4. longhorn-manager

    time="2024-05-11T11:48:42Z" level=info msg="Skipped rebuilding of replica because there is another rebuild in progress" 
    func="controller.(*EngineController).rebuildNewReplica" 
    file="engine_controller.go:1702" controller=longhorn-engine node=harvester-node-0 volume=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80
    

    image

Rebuilding becomes Replica scheduling failed afterward.
hvst-upgrade-replicaSchedulingFailed.zip
supportbundle_replicaSchedulingFailed.zip

image

image

image

FYI, have 3 more trials and does not hit this issue, decreasing reproducibility.

Well tested @albinsun

Still analyzing from the Longhorn side. No conclusion yet. Sorry for the delay!

Such logs are observed.

@albinsun Was this tested on air-gapped environment?

@starbops Could this be related to the auto-cleaned image, and your latest PR #5750 add LH related images to reserved list?

pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: engine is not running" node=harvester-node-0

 name: pvc-556f655d-7008-4750-a48c-99416b19dd8f

rep on node2: pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede

engineName: pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb

longhorn-manager-9tbdg/longhorn-manager.log:2024-05-11T10:04:49.369802669Z time="2024-05-11T10:04:49Z" level=error msg="Failed to sync Longhorn replica" func=controller.handleReconcileErrorLogging file="utils.go:67" Replica=longhorn-system/pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: failed to get instance manager for instance pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: cannot find the only available instance manager for instance pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede, node harvester-node-2, instance manager image longhornio/longhorn-instance-manager:v1.5.5, type aio" node=harvester-node-2

longhorn-manager-m82f9/longhorn-manager.log.1:2024-05-11T10:05:33.731269148Z time="2024-05-11T10:05:33Z" level=warning msg="Failed to get engine proxy of pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb for volume pvc-556f655d-7008-4750-a48c-99416b19dd8f" func="metrics_collector.(*VolumeCollector).Collect" file="volume_collector.go:192" collector=volume error="failed to get binary client for engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: cannot get client for engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: engine is not running" node=harvester-node-0

The LH engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb, currentState: stopped.

    name: pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb
    namespace: longhorn-system
    ownerReferences:
    - apiVersion: longhorn.io/v1beta2
      kind: Volume
      name: pvc-556f655d-7008-4750-a48c-99416b19dd8f
      uid: d5e0a640-93d8-49d4-af6d-66f4522843b0
    resourceVersion: "948228"
    uid: e71be425-99ad-42e8-bd11-92e2718e6b53
  spec:
    active: true
    backendStoreDriver: v1
    backupVolume: "null"
    desireState: stopped
    disableFrontend: false
    engineImage: longhornio/longhorn-engine:v1.5.5
    frontend: blockdev
    logRequested: false
    nodeID: "null"
    replicaAddressMap:
      pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-38fd9502: 10.52.1.14:10075
      pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-3e3032fc: 10.52.0.50:10075
      pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: 10.52.2.11:10090
    requestedBackupRestore: "null"

    currentSize: "10737418240"
    currentState: stopped
    endpoint: "null"    

Such logs are observed.

@albinsun Was this tested on air-gapped environment?
...

No but ipxe-example env. on my local machine.

@albinsun The root cause of replica of pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb built failed maybe related to below message:

longhorn-manager-m82f9/longhorn-manager.log:2024-05-11T13:32:02.444613957Z time="2024-05-11T13:32:02Z" level=error msg="There's no available disk for replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb, size 42949672960" func="scheduler.(*ReplicaScheduler).ScheduleReplica" file="replica_scheduler.go:101"

@albinsun The root cause of replica of pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb built failed maybe related to below message:

longhorn-manager-m82f9/longhorn-manager.log:2024-05-11T13:32:02.444613957Z time="2024-05-11T13:32:02Z" level=error msg="There's no available disk for replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb, size 42949672960" func="scheduler.(*ReplicaScheduler).ScheduleReplica" file="replica_scheduler.go:101"

Oh ok, it's quite possible since the node only has 500G disk and the hit run do has more backup/restore tests than others.
It's good if this is just an env. issue.
Thank you @w13915984028.

BTW, In the support-bundle, there are some middle-state related error messages which distracted my attention.

But for pvc-9201003a-3a04-4ab2-bb17-0e505447dc80, the no available disk is the root cause. We are safe to proceed.

Close as environment issue.
Will pay more attention on this kind of exception next time.
Sorry for inconvenience and thank you for the help.

Thanks @w13915984028,

I focused on the wrong SB. The latest one is on #5789 (comment) instead of the #5789 (comment)

And the root cause, like the @w13915984028 mentioned. It is related to the test environment. Thanks!

Thanks @w13915984028 and @Vicente-Cheng, I removed the milestone from the issue.

@w13915984028, good catch! I agree that the reason for the later "replica scheduling failed" is a lack of space as you described. It's a bit weird that we only hit it after the upgrade IMO, but I didn't investigate this too much.

@bk201, @Vicente-Cheng, and @bk201, after a full analysis of the first support bundle, I think I have identified two Longhonrn related issues that led to behavior @albinsun observed. I ran out of time to organize my notes into a detailed writeup, but for now, they are:

  1. When the migration engine pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1 was created, it almost immediately set two of its migration replicas to ERR due to a revision counter mismatch. Longhorn (probably correctly) did not update the fields of migration replica CRs to reflect this while the migration was ongoing, but these two replicas immediately failed when the migration was complete. This is an indirect cause of the behavior @albinsun observed, because it led to the rebuilding that never completed.
  2. Once the migration was complete (from harvester-node-1 to harvester-node-0), the two failed replicas immediately started to be rebuilt by the surviving replica (which happened to be on harvester-node-1. One replica was successfully rebuilt, but harvester-node-1 was restarted after the second rebuild started and before it could complete. Because of the way files are synced during a rebuild (details to come), the loss of the source replica did not trigger a rebuild failure in a way that propagated up the stack to longhorn-manager. It did not realize the rebuild failed for ~2 hours and 15 minutes. (I am pretty sure longhorn-manager was finally notified when the Linux TCP stack closed the connection between the rebuilding replica and its source client-side). This is the direct cause of the behavior @albinsun observed, which didn't resolve itself until well into the second support bundle. I think it is because we only cancel the file syncing connection if:

Additionally, we see a lot of contention between longhorn-managers trying to take ownership of the same object back and forth from each other at times. We may be able to improve something here as well.

I don't think these issues are likely to be caused by a regression. It seems likely to me that they combined to produce a behavior that is not very reproducible. I will create corresponding Longhorn issues as soon as I can.

@ejweber Excellent. The support bundle gives many interesting information from LH CRD objects and logs. It is worthy to filter each clue and optimize/enhance correspondingly. Thanks.

Thanks, @ejweber
I also noticed the client restarted and did not give any response from the below investigation.

Let's focus on the replica re-create on 2024-05-11T10:33:19, replica name: pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0

2024-05-11T10:33:19.023742976Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Adding replica" func="proxy.(*Proxy).ReplicaAdd" file="replica.go:33" currentSize=42949672960 engineName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1 fastSync=true replicaAddress="tcp://10.52.2.56:10031" replicaName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 restore=false serviceURL="10.52.0.128:10092" size=42949672960 volumeName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80
2024-05-11T10:33:19.043624243Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Adding replica tcp://10.52.2.56:10031 in WO mode" func="sync.(*Task).AddReplica" file="sync.go:422"
.
.
.
2024-05-11T10:33:19.097305657Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Using replica tcp://10.52.1.62:10080 as the source for rebuild" func="sync.(*Task).getTransferClients" file="sync.go:574"
2024-05-11T10:33:19.097616219Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Using replica tcp://10.52.2.56:10031 as the target for rebuild" func="sync.(*Task).getTransferClients" file="sync.go:579"
.
.
.
// rebuilding
2024-05-11T10:33:19.180751990Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3] time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-head-003.img.meta to 10.52.2.56:10034" func="rpc.(*SyncAgentServer).FileSend" file="server.go:342"
2024-05-11T10:33:19.180807240Z time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-head-003.img.meta to 10.52.2.56:10034: size 178, directIO false, fastSync false" func=sparse.SyncFile file="client.go:110"

Note, 10.52.1.62:10080 means the replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a
Until now, we have generated the sync file list (the snap chain). Then, we try to sync files between node1 (source) and node2 (target).

For the sync file behavior, we could imagine that the source side creates a server, and the client side sends the file to the server.
Server: https://github.com/longhorn/longhorn-engine/blob/master/pkg/sync/rpc/server.go#L466
Client: https://github.com/longhorn/longhorn-engine/blob/master/pkg/sync/rpc/server.go#L470-L472

We can find the corresponding logs on
target side (10.52.2.56), instance-manager-dac598cd4a1b746493fc409f60eaf07a
source side (10.52.1.62), `instance-manager-

// server
2024-05-11T10:33:19.304817291Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:19Z" level=info msg="Running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img at port 10035" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"

// client
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img to 10.52.2.56:10035" func="rpc.(*SyncAgentServer).FileSend" file="server.go:342"
time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img to 10.52.2.56:10035: size 42949672960, directIO true, fastSync true" func=sparse.SyncFile file="client.go:110"
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=warning msg="Failed to get change time and checksum of local file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img" func=sparse.SyncContent file="client.go:149" error="failed to open checksum file: open volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.checksum: no such file or directory"

// server
2024-05-11T10:33:20.038236353Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:19Z" level=info msg="Done running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img at port 10035" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"

2024-05-11T10:33:20.038295404Z time="2024-05-11T10:33:19Z" level=info msg="Running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.meta at port 10036" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"
2024-05-11T10:33:20.066432364Z time="2024-05-11T10:33:20Z" level=info msg="Done running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.meta at port 10036" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"


2024-05-11T10:33:20.083299991Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:20Z" level=info msg="Running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"

// looks like server timeout here
2024-05-11T10:36:00.997806717Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=error msg="Shutting down the server since it is idle for 1m30s" func=rest.Server.func1 file="server.go:111"
2024-05-11T10:36:01.011643136Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=info msg="Done running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"


// finally get the error, then replica recreate
2024-05-11T12:45:27.560746459Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T12:45:27Z" level=error msg="Sync agent gRPC server failed to rebuild replica/sync files" func="rpc.(*SyncAgentServer).FilesSync.func1" file="server.go:427" error="replica tcp://10.52.1.62:10080 failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: rpc error: code = Unavailable desc = error reading from server: read tcp 10.52.2.56:40316->10.52.1.62:10082: read: connection timed out"

There are some points I would like to figure out more, or @ejweber would like to give some perspective from LH side.

  1. Is the following log no harm? Looks like just a warning and no error after that.
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=warning msg="Failed to get change time and checksum of local file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img" func=sparse.SyncContent file="client.go:149" error="failed to open checksum file: open volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.checksum: no such file or directory"
  1. I saw the instance-manager-fc152a26b22a0d1d244d139fc8acceda restart on around 2024-05-11T10:33:31Z - 2024-05-11T10:42:20Z
    I wonder if it causes the client not to reply to the error because the GRC server is already gone (with restart).
    But the timeout value looks like 2 hours from this log.
2024-05-11T12:45:27.560746459Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T12:45:27Z" level=error msg="Sync agent gRPC server failed to rebuild replica/sync files" func="rpc.(*SyncAgentServer).FilesSync.func1" file="server.go:427" error="replica tcp://10.52.1.62:10080 failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: rpc error: code = Unavailable desc = error reading from server: read tcp 10.52.2.56:40316->10.52.1.62:10082: read: connection timed out"

I thought the GRPC client timeout was set to 24 hours, so I have no idea about the above 2 hours timeout.
And I thought that was an edge case. It's hard to reproduce w/o any specific config.

You are right @Vicente-Cheng. I had noticed the following log before, but didn't fit it into my analysis.

2024-05-11T10:36:01.011643136Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=info msg="Done running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"

This log indicates the ssync server (file receiver) running in pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 timed out. However, since the server launched successfully, it is not an error within SyncFiles. The error only occurs when the file sender, pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3, finally fails to send the file AND the file receiver recognizes the failure. It's something like this:

eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_time 
7200
eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
  • That is: 120 min before the first probe, then 9 probes at 75 second intervals, for a total of 10 additional minutes.
  • After approximately this amount of time, Linux informed pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 that the TCP connection was dead and we got the error.
  • In instance-manager, we use gRPC keepalives to ensure something like this can't happen, but this connection is deeper into the stack.

Basic file sync drawio

Thanks for the clarification! @ejweber

I thought the 2 hours timeout was related to the TCP keepalive mechanism, as you mentioned.
So the problem looks like when two replicas establish the connection and try to sync the file.
The source (sender) pvc is gone and the target (receiver) side did not receive anything.

In this case, we can only rely on the TCP timeout. Do we need another mechanism to monitor the connection status when rebuilding?

Yes I think so. Probably the gRPC keepalive I mentioned could be used on the SendFile/FileSend RPC between the destination replica and the source replica. In this case a keepalive could have recognized the source replica was gone very quickly. And in the case where the source replica is NOT gone, but actually SyncContent is taking a very long time, the source replica would respond to the keepalive ping and the intended behavior would be maintained. I will create an issue in the Longhorn repo about this.

Additionally, we see a lot of contention between longhorn-managers trying to take ownership of the same object back and forth from each other at times. We may be able to improve something here as well.

For this one, I think longhorn/longhorn#7531 is probably related.