longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes

Home Page:https://longhorn.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Uninstallation will fail if invalid backuptarget is set.

chriscchien opened this issue · comments

Describe the bug

Uninstallation will fail if invalid backuptarget is set.

Below logs repeatedly appeared in longhorn-uninstall pod

" controller=longhorn-uninstall error="failed to touch the backup target CR for API version migration: Operation cannot be fulfilled on backuptargets.longhorn.io \"default\": the object has been modified; please apply your changes to the latest version and try again"
time="2024-06-20T04:16:12Z" level=warning msg="Failed to uninstall" func="controller.(*UninstallController).handleErr" file="uninstall_controller.go:293" controller=longhorn-uninstall error="failed to touch the backup target CR for API version migration: Operation cannot be fulfilled on backuptargets.longhorn.io \"default\": the object has been modified; please apply your changes to the latest version and try again"

To Reproduce

  1. Setup invalid backup target
  2. Uninstall Longhorn
kubectl create -f https://raw.githubusercontent.com/longhorn/longhorn/master/uninstall/uninstall.yaml
kubectl get job/longhorn-uninstall -n longhorn-system -w
  1. The uninstallation job never finished.

Expected behavior

Uninstallation finished.

Support bundle for troubleshooting

supportbundle_7c841ce4-1dcf-488a-b3aa-052ae4784882_2024-06-20T04-19-00Z.zip

Environment

  • Longhorn version: master

Additional context

N/A

@chriscchien Also happened in v1.6.x and v1.5.x?

Quickly scan the longhorn-manager logs in the support bundle, in a second there are lots of

2024-06-20T04:18:44.848117979Z time="2024-06-20T04:18:44Z" level=error msg="Failed to get info from backup store" func="controller.(*BackupTargetController).reconcile" file="backup_target_controller.go:389" controller=longhorn-backup-target cred= error="failed to list backup volumes in nfs://longhorn-test-nfs-svc.defsdfsfdault:/opt/backupstore: error listing backup volume names: failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn [/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-master-head/longhorn backup ls --volume-only nfs://longhorn-test-nfs-svc.defsdfsfdault:/opt/backupstore], output cannot mount nfs longhorn-test-nfs-svc.defsdfsfdault:/opt/backupstore, options [nfsvers=4.0 actimeo=1 soft timeo=300 retry=2]: vers=4.0: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.0,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.defsdfsfdault:/opt/backupstore /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_defsdfsfdault/opt/backupstore\nOutput: mount.nfs4: Failed to resolve server longhorn-test-nfs-svc.defsdfsfdault: Name or service not known\n: vers=4.1: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.1,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.defsdfsfdault:/opt/backupstore /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_defsdfsfdault/opt/backupstore\nOutput: mount.nfs4: Failed to resolve server longhorn-test-nfs-svc.defsdfsfdault: Name or service not known\n: vers=4.2: mount failed: exit status 32\nMounting command: mount\nMounting arguments: -t nfs4 -o nfsvers=4.2,actimeo=1,soft,timeo=300,retry=2 longhorn-test-nfs-svc.defsdfsfdault:/opt/backupstore /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_defsdfsfdault/opt/backupstore\nOutput: mount.nfs4: Failed to resolve server longhorn-test-nfs-svc.defsdfsfdault: Name or service not known\n: cannot mount using NFSv4\n, stderr warning: GOCOVERDIR not set, no coverage data emitted\ntime=\"2024-06-20T04:18:44Z\" level=warning msg=\"Trying reading mount point /var/lib/longhorn-backupstore-mounts/longhorn-test-nfs-svc_defsdfsfdault/opt/backupstore to make sure it is healthy\" func=util.EnsureMountPoint file=\"util.go:309\" pkg=nfs\ntime=\"2024-06-20T04:18:44Z\" l
...

And frequently updating the backup target status will block the uninstall procedure.
It should poll the backup target status in pollInterval.

What's the poll interview set to cause this frequent update? The original should be 300 seconds.

The workaround would be to disable it by setting 0?

I think the frequent update is caused by that the error messages is not the same (including time like E0620 04:18:45.681625 20213 mount_linux.go:236])

The workaround would be to empty the backup target url first.

@chriscchien Also happened in v1.6.x and v1.5.x?

This issue can not be reproduced on v1.5.5 and v1.6.2.

" controller=longhorn-uninstall error="failed to touch the backup target CR for API version migration: Operation cannot be fulfilled on backuptargets.longhorn.io \"default\": the object has been modified; please apply your changes to the latest version and try again"

If it runs into the object has been modified, the failed update can be ignored. The purpose of the touch (update) is to trigger version migration. The error the object has been modified indicates the resource is already updated and should be migrated.

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Setup invalid backup target
  2. Uninstall Longhorn
  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:
  • Empty the backup target URL.
  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at
    longhorn/longhorn-manager#2897

  • Which areas/issues this PR might have potential impacts on?
    Area
    Issues

I think the frequent update is caused by that the error messages is not the same (including time like E0620 04:18:45.681625 20213 mount_linux.go:236])

Related to #8224.

@chriscchien Also happened in v1.6.x and v1.5.x?

This issue can not be reproduced on v1.5.5 and v1.6.2.

We should backport this, as longhorn/longhorn-manager#2812 was backported to 1.6.2 and 1.5.6 (unreleased) already?

" controller=longhorn-uninstall error="failed to touch the backup target CR for API version migration: Operation cannot be fulfilled on backuptargets.longhorn.io \"default\": the object has been modified; please apply your changes to the latest version and try again"

Does this mean that if Longhorn is installed from the master branch (w/o the fix) with a valid/invalid backup target configured, the installation will always fail?

" controller=longhorn-uninstall error="failed to touch the backup target CR for API version migration: Operation cannot be fulfilled on backuptargets.longhorn.io \"default\": the object has been modified; please apply your changes to the latest version and try again"

Does this mean that if Longhorn is installed from the master branch (w/o the fix) with a valid/invalid backup target configured, the installation will always fail?

Uninstallation? If the backup target is valid, it won't trigger the issue.

Uninstallation? If the backup target is valid, it won't trigger the issue.

ah, typo..

If the backup target is valid or invalid, all of them will walk through the UpdateBackupTarget call below. Why only the invalid case will be triggered? Is there anything I missed?

/controller/uninstall_controller.go#L322-L328

		} else if len(backupTargets) > 0 {
			for _, bt := range backupTargets {
				if _, err = c.ds.UpdateBackupTarget(bt); err != nil {
					return errors.Wrap(err, "failed to touch the backup target CR for API version migration")
				}
			}
		}

If the backup target is valid or invalid, all of them will walk through the UpdateBackupTarget call below. Why only the invalid case will be triggered? Is there anything I missed?

I think it is very unlikely for there to be a conflict in the case of a valid BackupTarget. But for the invalid case, #8224 causes frequent updates, so a conflict is quite likely.

Uninstallation? If the backup target is valid, it won't trigger the issue.

ah, typo..

If the backup target is valid or invalid, all of them will walk through the UpdateBackupTarget call below. Why only the invalid case will be triggered? Is there anything I missed?

/controller/uninstall_controller.go#L322-L328

		} else if len(backupTargets) > 0 {
			for _, bt := range backupTargets {
				if _, err = c.ds.UpdateBackupTarget(bt); err != nil {
					return errors.Wrap(err, "failed to touch the backup target CR for API version migration")
				}
			}
		}

https://github.com/longhorn/longhorn-manager/blob/master/controller/backup_target_controller.go#L394-L397
It is due to the frequent update of an invalid backup target.
Although the error message is the same, a different timestamp always leads to an update.

Uninstallation? If the backup target is valid, it won't trigger the issue.

ah, typo..

If the backup target is valid or invalid, all of them will walk through the UpdateBackupTarget call below. Why only the invalid case will be triggered? Is there anything I missed?

/controller/uninstall_controller.go#L322-L328

		} else if len(backupTargets) > 0 {
			for _, bt := range backupTargets {
				if _, err = c.ds.UpdateBackupTarget(bt); err != nil {
					return errors.Wrap(err, "failed to touch the backup target CR for API version migration")
				}
			}
		}

https://github.com/longhorn/longhorn-manager/blob/master/controller/backup_target_controller.go#L383-L391
It is due to the frequent update of an invalid backup target.
Although the error message is the same, a different timestamp always leads to a update.

Well explained @derekbit @ejweber

Verified pass on longhorn master(longhorn-manager b19161) with test steps

Uninstallation success when invalid backuptarget is set.