[Proposal] Adjust failure criteria for functional test

Question

[Proposal] Adjust failure criteria for functional test

youngbupark opened this issue 6 months ago · comments

Problem overview

The main goal of functional tests is to verify end-to-end functionality and correctness of the system. our functional tests using the external dependencies can expect the intermittent failures.

Functional test failures can be grouped into three primary categories:

External dependencies
Test framework issues
Bugs in Radius code

Most failures are associated with external dependencies according to the previous observation. However, the issues in both the test framework and Radius code are also significant as they directly impact test reliability. Resource not found is the one of the known problems in Test framework issues.

Experiments

To address intermittent failures, we experimented with two different strategies:

Adding a retry at each Go test function.
Retrying the entire workflow step using make functional-tests.

Alternative - adding API level retry in Radius service

Given that Radius calls control plane APIs of each cloud providers, invoking the CP API repeatedly in some scenarios could severely impact ongoing services. Thus, adopting a fail-fast strategy is preferable over implementing retries.

Observations

Adding a retry at each Go test function

Over a two-week period, the failure rate was reduced to 0.5% (2 out of 362 tests); the two failures didn't stem from functional tests itself but were instead related to setup issues while configuring the cluster.
This approach significantly improved reliability but might hide underlying root cases within the functional tests.

Retrying the entire workflow step using make functional-tests

This method frequently failed because it reran tests from the start without cleaning up resources in Kubernetes and Azure.
This can also hide underlying root cases within the functional tests.

Conclusion

While retries at go test function level enhance superficial reliability, they can obscure systemic problems within the tests. Functional tests should be designed to expose underlying issues readily--a perspective that may seem counterproductive but is crucial for detecting bugs, such as race-condition, that could be missed during manual testing conducted by contributors.

Proposal

The proposal aims to reduce the noise of failures and keep the early detection of real root cases without hiding them through excessive retries or alert noise, thereby supporting a more stable and reliable testing environment.

To better balance issue detection and noise reduction, I propose the following adjustments:

Only generate issues after at least two consecutive test failures. This approach reduces noise while ensuring that recurring problems are addressed.
Reevaluate the alert triggering thresholds. Considering that no system achieves 100% SLO and that our actual SLO is a product of multiple dependent services' SLOs/SLAs, our current trigger for alerts appears overly aggressive. Allowing for a few failures could provide a more realistic measure of system robustness and service reliability.

AB#11792

radius-triage-bot · Answer 1 · Thu Apr 18 2024 00:55:01 GMT+0800 (China Standard Time)

👋 @youngbupark Thanks for filing this issue.

A project maintainer will review this issue and get back to you soon.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

Young Bu Park · Answer 2 · Thu Apr 18 2024 01:00:34 GMT+0800 (China Standard Time)

PR for this proposal - #7485

Yetkin Timocin · Answer 3 · Thu Apr 18 2024 03:15:16 GMT+0800 (China Standard Time)

Can we apply this to samples repository too?

Nicole James · Answer 4 · Thu Apr 18 2024 03:36:22 GMT+0800 (China Standard Time)

For the retries in each go test function, did you track the actual number of times it hit the retry path? Does it make sense to limit retries to functions that could be impacted by cloud service latency and not every function?

Young Bu Park · Answer 5 · Thu Apr 18 2024 04:23:27 GMT+0800 (China Standard Time)

For the retries in each go test function, did you track the actual number of times it hit the retry path? Does it make sense to limit retries to functions that could be impacted by cloud service latency and not every function?

In the test, I set 3 retries at max. Unfortunately, I didn't track those metrics.

My key concern of this approach is that we can hide the actual bug by doing it. So, I would run the test as is while preserving the existing failure in the action, but trigger the notification when the consecutive failure happens. With this way, we can keep the failure and detect the problem.

Young Bu Park · Answer 6 · Thu Apr 18 2024 04:25:18 GMT+0800 (China Standard Time)

Can we apply this to samples repository too?

For sample test, I would first gather the cpu/mem usage metrics to understand the resource consumption. then I would use https://github.com/marketplace/actions/retry-step to playwrite step to repeat the playright test. In other worse, the second approach should work for playwrite test.

Sylvain Niles · Answer 7 · Fri Apr 19 2024 03:57:47 GMT+0800 (China Standard Time)

@youngbupark Could we do a lightweight experiment to emit a log entry on retry and have a short term manual effort to report on those entries? Then we would have the data to make a decision on a path forward.

radius-triage-bot · Answer 8 · Fri Apr 19 2024 03:58:05 GMT+0800 (China Standard Time)

👍 We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

Young Bu Park · Answer 9 · Fri Apr 19 2024 10:38:21 GMT+0800 (China Standard Time)

@youngbupark Could we do a lightweight experiment to emit a log entry on retry and have a short term manual effort to report on those entries? Then we would have the data to make a decision on a path forward.

I do not think we need to add any change for additional metrics because we have already had this data. The metrics should not be different from the failed action pattern in the functional test workflow because retry happens when the test is failed.

When you see the workflow run state, consecutive failures rarely happen when there is an external dependency problem. e.g. network, cluster fail (including limited resource), and docker registry throttling. cc/ @nicolejms @sylvainsf