[BUG] Upgrade stucks in node `Pre-draining` due to waiting volume rebuilding
albinsun opened this issue · comments
Describe the bug
v1.2.1
-> v1.2.2-rc3
upgrade stucks in node Pre-draining
due to wait volume rebuilding.
To Reproduce
Note
Not always reproducible and current reproducibility is 1/5.
- Setup 3 node
harvester-v1.2.1
- Enable
rancher-monitoring
addon - Import Harvester into
rancher-v2.7.11
- Create a RKE2 cluster, deploy nginx and LB
- 🔴 Upgrade Harvester to
v1.2.2-rc3
Stuck in
Pre-draining
due to waiting volume rebuilding
Expected behavior
Upgrade successfully
Support bundle
support-bundle-stuckRebuilding.zip
Upgrade log
hvst-upgrade-p4kfq-upgradelog-archive-stuckRebuilding.zip
Environment
- Harvester
- Version:
v1.2.1
->v1.2.2-rc3
- Profile: QEMU/KVM, 3 nodes (8C/16G/500G)
- ui-source:
Auto
- Version:
- Rancher
- Version:
v2.7.11
- Profile: Helm(K3s) in QEMU/KVM (2C/4G)
- Version:
Additional context
-
instance-manager-df64d0429b56f6f2f48b2c7150c32f38
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1] time="2024-05-11T11:46:35Z" level=warning msg="Failed to unmap" func="controller.(*Controller).UnmapAt" file="control.go:959" error="cannot unmap 188416 bytes at offset 15943786496 while rebuilding is in progress"
-
longhorn-manager
time="2024-05-11T11:48:42Z" level=info msg="Skipped rebuilding of replica because there is another rebuild in progress" func="controller.(*EngineController).rebuildNewReplica" file="engine_controller.go:1702" controller=longhorn-engine node=harvester-node-0 volume=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80
Rebuilding becomes Replica scheduling failed
afterward.
hvst-upgrade-replicaSchedulingFailed.zip
supportbundle_replicaSchedulingFailed.zip
FYI, have 3 more trials and does not hit this issue, decreasing reproducibility.
Still analyzing from the Longhorn side. No conclusion yet. Sorry for the delay!
Such logs are observed.
@albinsun Was this tested on air-gapped environment?
@starbops Could this be related to the auto-cleaned image, and your latest PR #5750 add LH related images to reserved list?
pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: engine is not running" node=harvester-node-0
name: pvc-556f655d-7008-4750-a48c-99416b19dd8f
rep on node2: pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede
engineName: pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb
longhorn-manager-9tbdg/longhorn-manager.log:2024-05-11T10:04:49.369802669Z time="2024-05-11T10:04:49Z" level=error msg="Failed to sync Longhorn replica" func=controller.handleReconcileErrorLogging file="utils.go:67" Replica=longhorn-system/pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede controller=longhorn-replica error="failed to sync replica for longhorn-system/pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: failed to get instance manager for instance pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: cannot find the only available instance manager for instance pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede, node harvester-node-2, instance manager image longhornio/longhorn-instance-manager:v1.5.5, type aio" node=harvester-node-2
longhorn-manager-m82f9/longhorn-manager.log.1:2024-05-11T10:05:33.731269148Z time="2024-05-11T10:05:33Z" level=warning msg="Failed to get engine proxy of pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb for volume pvc-556f655d-7008-4750-a48c-99416b19dd8f" func="metrics_collector.(*VolumeCollector).Collect" file="volume_collector.go:192" collector=volume error="failed to get binary client for engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: cannot get client for engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb: engine is not running" node=harvester-node-0
The LH engine pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb
, currentState: stopped
.
name: pvc-556f655d-7008-4750-a48c-99416b19dd8f-e-487f7adb
namespace: longhorn-system
ownerReferences:
- apiVersion: longhorn.io/v1beta2
kind: Volume
name: pvc-556f655d-7008-4750-a48c-99416b19dd8f
uid: d5e0a640-93d8-49d4-af6d-66f4522843b0
resourceVersion: "948228"
uid: e71be425-99ad-42e8-bd11-92e2718e6b53
spec:
active: true
backendStoreDriver: v1
backupVolume: "null"
desireState: stopped
disableFrontend: false
engineImage: longhornio/longhorn-engine:v1.5.5
frontend: blockdev
logRequested: false
nodeID: "null"
replicaAddressMap:
pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-38fd9502: 10.52.1.14:10075
pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-3e3032fc: 10.52.0.50:10075
pvc-556f655d-7008-4750-a48c-99416b19dd8f-r-f3e4dede: 10.52.2.11:10090
requestedBackupRestore: "null"
currentSize: "10737418240"
currentState: stopped
endpoint: "null"
Such logs are observed.
@albinsun Was this tested on air-gapped environment?
...
No but ipxe-example env. on my local machine.
@albinsun The root cause of replica of pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb
built failed maybe related to below message:
longhorn-manager-m82f9/longhorn-manager.log:2024-05-11T13:32:02.444613957Z time="2024-05-11T13:32:02Z" level=error msg="There's no available disk for replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb, size 42949672960" func="scheduler.(*ReplicaScheduler).ScheduleReplica" file="replica_scheduler.go:101"
@albinsun The root cause of replica of
pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb
built failed maybe related to below message:longhorn-manager-m82f9/longhorn-manager.log:2024-05-11T13:32:02.444613957Z time="2024-05-11T13:32:02Z" level=error msg="There's no available disk for replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-46c81deb, size 42949672960" func="scheduler.(*ReplicaScheduler).ScheduleReplica" file="replica_scheduler.go:101"
Oh ok, it's quite possible since the node only has 500G disk and the hit run do has more backup/restore tests than others.
It's good if this is just an env. issue.
Thank you @w13915984028.
BTW, In the support-bundle, there are some middle-state related error messages which distracted my attention.
But for pvc-9201003a-3a04-4ab2-bb17-0e505447dc80
, the no available disk
is the root cause. We are safe to proceed.
Close as environment issue.
Will pay more attention on this kind of exception next time.
Sorry for inconvenience and thank you for the help.
Thanks @w13915984028,
I focused on the wrong SB. The latest one is on #5789 (comment) instead of the #5789 (comment)
And the root cause, like the @w13915984028 mentioned. It is related to the test environment. Thanks!
Thanks @w13915984028 and @Vicente-Cheng, I removed the milestone from the issue.
@w13915984028, good catch! I agree that the reason for the later "replica scheduling failed" is a lack of space as you described. It's a bit weird that we only hit it after the upgrade IMO, but I didn't investigate this too much.
@bk201, @Vicente-Cheng, and @bk201, after a full analysis of the first support bundle, I think I have identified two Longhonrn related issues that led to behavior @albinsun observed. I ran out of time to organize my notes into a detailed writeup, but for now, they are:
- When the migration engine
pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1
was created, it almost immediately set two of its migration replicas to ERR due to a revision counter mismatch. Longhorn (probably correctly) did not update the fields of migration replica CRs to reflect this while the migration was ongoing, but these two replicas immediately failed when the migration was complete. This is an indirect cause of the behavior @albinsun observed, because it led to the rebuilding that never completed. - Once the migration was complete (from
harvester-node-1
toharvester-node-0
), the two failed replicas immediately started to be rebuilt by the surviving replica (which happened to be onharvester-node-1
. One replica was successfully rebuilt, butharvester-node-1
was restarted after the second rebuild started and before it could complete. Because of the way files are synced during a rebuild (details to come), the loss of the source replica did not trigger a rebuild failure in a way that propagated up the stack to longhorn-manager. It did not realize the rebuild failed for ~2 hours and 15 minutes. (I am pretty sure longhorn-manager was finally notified when the Linux TCP stack closed the connection between the rebuilding replica and its source client-side). This is the direct cause of the behavior @albinsun observed, which didn't resolve itself until well into the second support bundle. I think it is because we only cancel the file syncing connection if:- The source replica does not contact the receiver AT ALL before its 1:30 second idle timer expires (https://github.com/longhorn/longhorn-engine/blob/a807f0fd6bfd9c4700f2c19808038e87a2ab814e/vendor/github.com/longhorn/sparse-tools/sparse/rest/server.go#L78-L117). This causes us to cancel from the receiving side.
- The source replica times out while trying to send a chunk of the file to the receiver according to HTTPClientTimeout, which defaults to 30 seconds (https://github.com/longhorn/longhorn-engine/blob/a807f0fd6bfd9c4700f2c19808038e87a2ab814e/vendor/github.com/longhorn/sparse-tools/sparse/client.go#L97-L120). This causes us to cancel from the sending side.
- In this case, the sender went away after establishing a connection, so I guess we did not detect it? We can investigate more and see if it is reproducible.
Additionally, we see a lot of contention between longhorn-managers trying to take ownership of the same object back and forth from each other at times. We may be able to improve something here as well.
I don't think these issues are likely to be caused by a regression. It seems likely to me that they combined to produce a behavior that is not very reproducible. I will create corresponding Longhorn issues as soon as I can.
@ejweber Excellent. The support bundle gives many interesting information from LH CRD objects and logs. It is worthy to filter each clue and optimize/enhance correspondingly. Thanks.
Thanks, @ejweber
I also noticed the client restarted and did not give any response from the below investigation.
Let's focus on the replica re-create on 2024-05-11T10:33:19, replica name: pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0
2024-05-11T10:33:19.023742976Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Adding replica" func="proxy.(*Proxy).ReplicaAdd" file="replica.go:33" currentSize=42949672960 engineName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-e-1 fastSync=true replicaAddress="tcp://10.52.2.56:10031" replicaName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 restore=false serviceURL="10.52.0.128:10092" size=42949672960 volumeName=pvc-9201003a-3a04-4ab2-bb17-0e505447dc80
2024-05-11T10:33:19.043624243Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Adding replica tcp://10.52.2.56:10031 in WO mode" func="sync.(*Task).AddReplica" file="sync.go:422"
.
.
.
2024-05-11T10:33:19.097305657Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Using replica tcp://10.52.1.62:10080 as the source for rebuild" func="sync.(*Task).getTransferClients" file="sync.go:574"
2024-05-11T10:33:19.097616219Z [longhorn-instance-manager] time="2024-05-11T10:33:19Z" level=info msg="Using replica tcp://10.52.2.56:10031 as the target for rebuild" func="sync.(*Task).getTransferClients" file="sync.go:579"
.
.
.
// rebuilding
2024-05-11T10:33:19.180751990Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3] time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-head-003.img.meta to 10.52.2.56:10034" func="rpc.(*SyncAgentServer).FileSend" file="server.go:342"
2024-05-11T10:33:19.180807240Z time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-head-003.img.meta to 10.52.2.56:10034: size 178, directIO false, fastSync false" func=sparse.SyncFile file="client.go:110"
Note, 10.52.1.62:10080
means the replica pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a
Until now, we have generated the sync file list (the snap chain). Then, we try to sync files between node1 (source) and node2 (target).
For the sync file behavior, we could imagine that the source side creates a server, and the client side sends the file to the server.
Server: https://github.com/longhorn/longhorn-engine/blob/master/pkg/sync/rpc/server.go#L466
Client: https://github.com/longhorn/longhorn-engine/blob/master/pkg/sync/rpc/server.go#L470-L472
We can find the corresponding logs on
target side (10.52.2.56), instance-manager-dac598cd4a1b746493fc409f60eaf07a
source side (10.52.1.62), `instance-manager-
// server
2024-05-11T10:33:19.304817291Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:19Z" level=info msg="Running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img at port 10035" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"
// client
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img to 10.52.2.56:10035" func="rpc.(*SyncAgentServer).FileSend" file="server.go:342"
time="2024-05-11T10:33:19Z" level=info msg="Syncing file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img to 10.52.2.56:10035: size 42949672960, directIO true, fastSync true" func=sparse.SyncFile file="client.go:110"
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=warning msg="Failed to get change time and checksum of local file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img" func=sparse.SyncContent file="client.go:149" error="failed to open checksum file: open volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.checksum: no such file or directory"
// server
2024-05-11T10:33:20.038236353Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:19Z" level=info msg="Done running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img at port 10035" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"
2024-05-11T10:33:20.038295404Z time="2024-05-11T10:33:19Z" level=info msg="Running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.meta at port 10036" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"
2024-05-11T10:33:20.066432364Z time="2024-05-11T10:33:20Z" level=info msg="Done running ssync server for file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.meta at port 10036" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"
2024-05-11T10:33:20.083299991Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:33:20Z" level=info msg="Running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:406"
// looks like server timeout here
2024-05-11T10:36:00.997806717Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=error msg="Shutting down the server since it is idle for 1m30s" func=rest.Server.func1 file="server.go:111"
2024-05-11T10:36:01.011643136Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=info msg="Done running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"
// finally get the error, then replica recreate
2024-05-11T12:45:27.560746459Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T12:45:27Z" level=error msg="Sync agent gRPC server failed to rebuild replica/sync files" func="rpc.(*SyncAgentServer).FilesSync.func1" file="server.go:427" error="replica tcp://10.52.1.62:10080 failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: rpc error: code = Unavailable desc = error reading from server: read tcp 10.52.2.56:40316->10.52.1.62:10082: read: connection timed out"
There are some points I would like to figure out more, or @ejweber would like to give some perspective from LH side.
- Is the following log no harm? Looks like just a warning and no error after that.
[pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-4da6951a] time="2024-05-11T10:33:19Z" level=warning msg="Failed to get change time and checksum of local file volume-snap-38575222-e610-4632-8b6c-622faa205a55.img" func=sparse.SyncContent file="client.go:149" error="failed to open checksum file: open volume-snap-38575222-e610-4632-8b6c-622faa205a55.img.checksum: no such file or directory"
- I saw the instance-manager-fc152a26b22a0d1d244d139fc8acceda restart on around
2024-05-11T10:33:31Z
-2024-05-11T10:42:20Z
I wonder if it causes the client not to reply to the error because the GRC server is already gone (with restart).
But the timeout value looks like 2 hours from this log.
2024-05-11T12:45:27.560746459Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T12:45:27Z" level=error msg="Sync agent gRPC server failed to rebuild replica/sync files" func="rpc.(*SyncAgentServer).FilesSync.func1" file="server.go:427" error="replica tcp://10.52.1.62:10080 failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: failed to send file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img to 10.52.2.56:10037: rpc error: code = Unavailable desc = error reading from server: read tcp 10.52.2.56:40316->10.52.1.62:10082: read: connection timed out"
I thought the GRPC client timeout was set to 24 hours, so I have no idea about the above 2 hours timeout.
And I thought that was an edge case. It's hard to reproduce w/o any specific config.
You are right @Vicente-Cheng. I had noticed the following log before, but didn't fit it into my analysis.
2024-05-11T10:36:01.011643136Z [pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0] time="2024-05-11T10:36:00Z" level=info msg="Done running ssync server for file volume-snap-76e04b0f-baf1-47ee-b7cd-c941937bac73.img at port 10037" func="rpc.(*SyncAgentServer).launchReceiver.func1" file="server.go:411"
This log indicates the ssync server (file receiver) running in pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 timed out. However, since the server launched successfully, it is not an error within SyncFiles
. The error only occurs when the file sender, pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3, finally fails to send the file AND the file receiver recognizes the failure. It's something like this:
- pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 is the
SyncAgentServer
responsible for handling the file sync. It is the replica that needs to be rebuilt. - pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 launches a receiver. https://github.com/longhorn/longhorn-engine/blob/7dbeb34fb049b1b0ca80c76d5c684b09c6d8b097/pkg/sync/rpc/server.go#L464
- pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 sends a
SendFile
request to pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3. https://github.com/longhorn/longhorn-engine/blob/7dbeb34fb049b1b0ca80c76d5c684b09c6d8b097/pkg/sync/rpc/server.go#L468-L470 - As you mentioned, the
SendFile
request has the 24 hourGRPCServiceLongTimeout
. It means the request will be canceled if it is not complete within 24 hours. https://github.com/longhorn/longhorn-engine/blob/7dbeb34fb049b1b0ca80c76d5c684b09c6d8b097/pkg/replica/client/client.go#L441-L460 - Presumably a TCP connection is established between the two replicas as part of the
SendFile
request, before pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3 disappears. - At this point, pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 is just waiting for the
SendFile
request to complete. There is nothing for it to do. pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-3d3731c3 is gone, but it wasn't necessarily expected to return anything over the connection for a long time. - I think we finally receive an error after ~2 hr and 15 minutes because that is how long it takes Linux to complete its TCP keepalive behavior. It is not something currently built into our code. I don't know the values on the QA system, but on mine, they are:
eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
eweber@laptop:~/longhorn-engine> cat /proc/sys/net/ipv4/tcp_keepalive_probes
9
- That is: 120 min before the first probe, then 9 probes at 75 second intervals, for a total of 10 additional minutes.
- After approximately this amount of time, Linux informed pvc-9201003a-3a04-4ab2-bb17-0e505447dc80-r-782731c0 that the TCP connection was dead and we got the error.
- In instance-manager, we use gRPC keepalives to ensure something like this can't happen, but this connection is deeper into the stack.
Thanks for the clarification! @ejweber
I thought the 2 hours timeout was related to the TCP keepalive mechanism, as you mentioned.
So the problem looks like when two replicas establish the connection and try to sync the file.
The source (sender) pvc is gone and the target (receiver) side did not receive anything.
In this case, we can only rely on the TCP timeout. Do we need another mechanism to monitor the connection status when rebuilding?
Yes I think so. Probably the gRPC keepalive I mentioned could be used on the SendFile
/FileSend
RPC between the destination replica and the source replica. In this case a keepalive could have recognized the source replica was gone very quickly. And in the case where the source replica is NOT gone, but actually SyncContent
is taking a very long time, the source replica would respond to the keepalive ping and the intended behavior would be maintained. I will create an issue in the Longhorn repo about this.
Additionally, we see a lot of contention between longhorn-managers trying to take ownership of the same object back and forth from each other at times. We may be able to improve something here as well.
For this one, I think longhorn/longhorn#7531 is probably related.