openshift / oadp-operator

OADP Operator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Test workloads may not clean up properly after each test

kaovilai opened this issue · comments

#1223

reportAfterEach which contained cleanup code was refactored recently and maybe some bugs were added?

https://github.com/openshift/oadp-operator/pull/1223/files#diff-f7f90d1b403b9620176bb058724a225eacdbf0fb56df0c9f77f7cd4f3152f917L68

	var _ = ReportAfterEach(func(report SpecReport) {
		if report.State == types.SpecStateSkipped || report.State == types.SpecStatePending {
			// do not run if the test is skipped
			return
		}
		GinkgoWriter.Println("Report after each: state: ", report.State.String())
		if report.Failed() {
			// print namespace error events for app namespace
			if lastBRCase.ApplicationNamespace != "" {
				GinkgoWriter.Println("Printing app namespace events")
				PrintNamespaceEventsAfterTime(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace, lastInstallTime)
			}
			GinkgoWriter.Println("Printing oadp namespace events")
			PrintNamespaceEventsAfterTime(kubernetesClientForSuiteRun, namespace, lastInstallTime)
			baseReportDir := artifact_dir + "/" + report.LeafNodeText
			err := os.MkdirAll(baseReportDir, 0755)
			Expect(err).NotTo(HaveOccurred())
			err = SavePodLogs(kubernetesClientForSuiteRun, namespace, baseReportDir)
			Expect(err).NotTo(HaveOccurred())
			err = SavePodLogs(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace, baseReportDir)
			Expect(err).NotTo(HaveOccurred())
		}
		// remove app namespace if leftover (likely previously failed before reaching uninstall applications) to clear items such as PVCs which are immutable so that next test can create new ones
		err := dpaCR.Client.Delete(context.Background(), &corev1.Namespace{ObjectMeta: v1.ObjectMeta{
			Name:      lastBRCase.ApplicationNamespace,
			Namespace: lastBRCase.ApplicationNamespace,
		}}, &client.DeleteOptions{})
		if k8serror.IsNotFound(err) {
			err = nil
		}
		Expect(err).ToNot(HaveOccurred())

		err = dpaCR.Delete(runTimeClientForSuiteRun)
		Expect(err).ToNot(HaveOccurred())
		Eventually(IsNamespaceDeleted(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace), timeoutMultiplier*time.Minute*2, time.Second*5).Should(BeTrue())
	})

Seeing this moved into func tearDownBackupAndRestore(

	var _ = AfterEach(func(ctx SpecContext) {
		tearDownBackupAndRestore(lastBRCase, lastInstallTime, ctx.SpecReport())
	})

Being run here.

This should still not run on anyting that is skipped.

		if report.State == types.SpecStateSkipped || report.State == types.SpecStatePending {
			// do not run if the test is skipped
			return
		}

I'm not seeing SpecStateSkipped added

I think this is not needed for AfterEach. Docs for reportAfterEach "ReportAfterEach nodes are run for each spec, even if the spec is skipped or pending."

not sure exactly what I spotted but it was along the lines of failures related to an existing workload deployment already exists blocking new creation in logs somewhere.

I will run e2e and check cleanup still works later.. and will close this if so

with the debug logs, it should be easy to catch missed after each calls, they appear saying something like "entering after each"

example 2 #1218 (comment)
There are 9 results with - "velero.io/backup-name": string("
which is part of the code that shows unexpected label diff with an existing workload that was not cleanup.

The following backup name label diff should not (ie. object in cluster should not already exists) occur on e2e-app creation if cleanup was done.

diff found for key: metadata
    map[string]any{
    	"creationTimestamp": string("2023-11-27T23:06:24Z"),
    	"generation":        int64(1),
    	"labels": map[string]any{
    		"e2e-app":                string("true"),
  - 		"velero.io/backup-name":  string("mongo-datamover-e2e-3ec695a1-8d79-11ee-8f5d-0a580a83a32f"),
  - 		"velero.io/restore-name": string("mongo-datamover-e2e-3ec695aa-8d79-11ee-8f5d-0a580a83a32f"),

/label CI

@mpryc: The label(s) /label CI cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed, rebase/manual, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/lifecycle frozen

Looks like we're not cleaning up SCC. This log kept popping up when installing apps

updating resource: security.openshift.io/v1, Kind=SecurityContextConstraints; name: mongo-persistent-scc

@kaovilai yeah, those are cluster-scoped, so namespace removal won't get rid of them.