Test workloads may not clean up properly after each test

Question

Test workloads may not clean up properly after each test

kaovilai opened this issue 9 months ago · comments

reportAfterEach which contained cleanup code was refactored recently and maybe some bugs were added?

https://github.com/openshift/oadp-operator/pull/1223/files#diff-f7f90d1b403b9620176bb058724a225eacdbf0fb56df0c9f77f7cd4f3152f917L68

	var _ = ReportAfterEach(func(report SpecReport) {
		if report.State == types.SpecStateSkipped || report.State == types.SpecStatePending {
			// do not run if the test is skipped
			return
		}
		GinkgoWriter.Println("Report after each: state: ", report.State.String())
		if report.Failed() {
			// print namespace error events for app namespace
			if lastBRCase.ApplicationNamespace != "" {
				GinkgoWriter.Println("Printing app namespace events")
				PrintNamespaceEventsAfterTime(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace, lastInstallTime)
			}
			GinkgoWriter.Println("Printing oadp namespace events")
			PrintNamespaceEventsAfterTime(kubernetesClientForSuiteRun, namespace, lastInstallTime)
			baseReportDir := artifact_dir + "/" + report.LeafNodeText
			err := os.MkdirAll(baseReportDir, 0755)
			Expect(err).NotTo(HaveOccurred())
			err = SavePodLogs(kubernetesClientForSuiteRun, namespace, baseReportDir)
			Expect(err).NotTo(HaveOccurred())
			err = SavePodLogs(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace, baseReportDir)
			Expect(err).NotTo(HaveOccurred())
		}
		// remove app namespace if leftover (likely previously failed before reaching uninstall applications) to clear items such as PVCs which are immutable so that next test can create new ones
		err := dpaCR.Client.Delete(context.Background(), &corev1.Namespace{ObjectMeta: v1.ObjectMeta{
			Name:      lastBRCase.ApplicationNamespace,
			Namespace: lastBRCase.ApplicationNamespace,
		}}, &client.DeleteOptions{})
		if k8serror.IsNotFound(err) {
			err = nil
		}
		Expect(err).ToNot(HaveOccurred())

		err = dpaCR.Delete(runTimeClientForSuiteRun)
		Expect(err).ToNot(HaveOccurred())
		Eventually(IsNamespaceDeleted(kubernetesClientForSuiteRun, lastBRCase.ApplicationNamespace), timeoutMultiplier*time.Minute*2, time.Second*5).Should(BeTrue())
	})

Michal Pryc commented 9 months ago

/label CI

Tiger Kaovilai · Answer 1 · Wed Nov 29 2023 04:54:31 GMT+0800 (China Standard Time)

Seeing this moved into func tearDownBackupAndRestore(

Mateus Oliveira · Answer 2 · Wed Nov 29 2023 04:55:08 GMT+0800 (China Standard Time)

I changed all ReportAfterEach to AfterEach

Example: https://github.com/openshift/oadp-operator/blob/master/tests/e2e/backup_restore_suite_test.go#L263-L65

Tiger Kaovilai · Answer 3 · Wed Nov 29 2023 04:56:29 GMT+0800 (China Standard Time)

	var _ = AfterEach(func(ctx SpecContext) {
		tearDownBackupAndRestore(lastBRCase, lastInstallTime, ctx.SpecReport())
	})

Being run here.

This should still not run on anyting that is skipped.

		if report.State == types.SpecStateSkipped || report.State == types.SpecStatePending {
			// do not run if the test is skipped
			return
		}

I'm not seeing SpecStateSkipped added

Mateus Oliveira · Answer 4 · Wed Nov 29 2023 05:05:36 GMT+0800 (China Standard Time)

I think this is not needed for AfterEach. Docs for reportAfterEach "ReportAfterEach nodes are run for each spec, even if the spec is skipped or pending."

Tiger Kaovilai · Answer 5 · Wed Nov 29 2023 05:09:28 GMT+0800 (China Standard Time)

not sure exactly what I spotted but it was along the lines of failures related to an existing workload deployment already exists blocking new creation in logs somewhere.

Tiger Kaovilai · Answer 6 · Wed Nov 29 2023 05:10:18 GMT+0800 (China Standard Time)

I will run e2e and check cleanup still works later.. and will close this if so

Mateus Oliveira · Answer 7 · Wed Nov 29 2023 20:23:56 GMT+0800 (China Standard Time)

with the debug logs, it should be easy to catch missed after each calls, they appear saying something like "entering after each"

Mateus Oliveira · Answer 8 · Wed Nov 29 2023 22:58:56 GMT+0800 (China Standard Time)

example problem #1239 (comment)

Tiger Kaovilai · Answer 9 · Thu Nov 30 2023 02:57:35 GMT+0800 (China Standard Time)

example 2 #1218 (comment)
There are 9 results with - "velero.io/backup-name": string("
which is part of the code that shows unexpected label diff with an existing workload that was not cleanup.

Tiger Kaovilai · Answer 10 · Thu Nov 30 2023 02:59:14 GMT+0800 (China Standard Time)

The following backup name label diff should not (ie. object in cluster should not already exists) occur on e2e-app creation if cleanup was done.

diff found for key: metadata
    map[string]any{
    	"creationTimestamp": string("2023-11-27T23:06:24Z"),
    	"generation":        int64(1),
    	"labels": map[string]any{
    		"e2e-app":                string("true"),
  - 		"velero.io/backup-name":  string("mongo-datamover-e2e-3ec695a1-8d79-11ee-8f5d-0a580a83a32f"),
  - 		"velero.io/restore-name": string("mongo-datamover-e2e-3ec695aa-8d79-11ee-8f5d-0a580a83a32f"),

openshift-ci · Answer 11 · Thu Nov 30 2023 20:27:14 GMT+0800 (China Standard Time)

@mpryc: The label(s) /label CI cannot be applied. These labels are supported: acknowledge-critical-fixes-only, platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed, rebase/manual, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

OpenShift Bot · Answer 12 · Thu Feb 29 2024 09:00:11 GMT+0800 (China Standard Time)

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Tiger Kaovilai · Answer 13 · Wed Mar 06 2024 02:18:10 GMT+0800 (China Standard Time)

/lifecycle frozen

Tiger Kaovilai · Answer 14 · Wed Mar 06 2024 02:27:06 GMT+0800 (China Standard Time)

Looks like we're not cleaning up SCC. This log kept popping up when installing apps

updating resource: security.openshift.io/v1, Kind=SecurityContextConstraints; name: mongo-persistent-scc

Scott Seago · Answer 15 · Wed Mar 06 2024 04:33:16 GMT+0800 (China Standard Time)

@kaovilai yeah, those are cluster-scoped, so namespace removal won't get rid of them.