radius-project / radius

Radius is a cloud-native, portable application platform that makes app development easier for teams building cloud-native apps.

Home Page:https://radapp.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Proposal] Adjust failure criteria for functional test

youngbupark opened this issue · comments

Problem overview

The main goal of functional tests is to verify end-to-end functionality and correctness of the system. our functional tests using the external dependencies can expect the intermittent failures.

Functional test failures can be grouped into three primary categories:

  • External dependencies
  • Test framework issues
  • Bugs in Radius code

Most failures are associated with external dependencies according to the previous observation. However, the issues in both the test framework and Radius code are also significant as they directly impact test reliability. Resource not found is the one of the known problems in Test framework issues.

Experiments

To address intermittent failures, we experimented with two different strategies:

  • Adding a retry at each Go test function.
  • Retrying the entire workflow step using make functional-tests.

Alternative - adding API level retry in Radius service

Given that Radius calls control plane APIs of each cloud providers, invoking the CP API repeatedly in some scenarios could severely impact ongoing services. Thus, adopting a fail-fast strategy is preferable over implementing retries.

Observations

Adding a retry at each Go test function

  • Over a two-week period, the failure rate was reduced to 0.5% (2 out of 362 tests); the two failures didn't stem from functional tests itself but were instead related to setup issues while configuring the cluster.
  • This approach significantly improved reliability but might hide underlying root cases within the functional tests.

Retrying the entire workflow step using make functional-tests

  • This method frequently failed because it reran tests from the start without cleaning up resources in Kubernetes and Azure.
  • This can also hide underlying root cases within the functional tests.

Conclusion

While retries at go test function level enhance superficial reliability, they can obscure systemic problems within the tests. Functional tests should be designed to expose underlying issues readily--a perspective that may seem counterproductive but is crucial for detecting bugs, such as race-condition, that could be missed during manual testing conducted by contributors.

Proposal

The proposal aims to reduce the noise of failures and keep the early detection of real root cases without hiding them through excessive retries or alert noise, thereby supporting a more stable and reliable testing environment.

To better balance issue detection and noise reduction, I propose the following adjustments:

  • Only generate issues after at least two consecutive test failures. This approach reduces noise while ensuring that recurring problems are addressed.
  • Reevaluate the alert triggering thresholds. Considering that no system achieves 100% SLO and that our actual SLO is a product of multiple dependent services' SLOs/SLAs, our current trigger for alerts appears overly aggressive. Allowing for a few failures could provide a more realistic measure of system robustness and service reliability.

AB#11792

👋 @youngbupark Thanks for filing this issue.

A project maintainer will review this issue and get back to you soon.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

PR for this proposal - #7485

Can we apply this to samples repository too?

For the retries in each go test function, did you track the actual number of times it hit the retry path? Does it make sense to limit retries to functions that could be impacted by cloud service latency and not every function?

For the retries in each go test function, did you track the actual number of times it hit the retry path? Does it make sense to limit retries to functions that could be impacted by cloud service latency and not every function?

In the test, I set 3 retries at max. Unfortunately, I didn't track those metrics.

My key concern of this approach is that we can hide the actual bug by doing it. So, I would run the test as is while preserving the existing failure in the action, but trigger the notification when the consecutive failure happens. With this way, we can keep the failure and detect the problem.

Can we apply this to samples repository too?

For sample test, I would first gather the cpu/mem usage metrics to understand the resource consumption. then I would use https://github.com/marketplace/actions/retry-step to playwrite step to repeat the playright test. In other worse, the second approach should work for playwrite test.

@youngbupark Could we do a lightweight experiment to emit a log entry on retry and have a short term manual effort to report on those entries? Then we would have the data to make a decision on a path forward.

👍 We've reviewed this issue and have agreed to add it to our backlog. Please subscribe to this issue for notifications, we'll provide updates when we pick it up.

We also welcome community contributions! If you would like to pick this item up sooner and submit a pull request, please visit our contribution guidelines and assign this to yourself by commenting "/assign" on this issue.

For more information on our triage process please visit our triage overview

@youngbupark Could we do a lightweight experiment to emit a log entry on retry and have a short term manual effort to report on those entries? Then we would have the data to make a decision on a path forward.

I do not think we need to add any change for additional metrics because we have already had this data. The metrics should not be different from the failed action pattern in the functional test workflow because retry happens when the test is failed.

When you see the workflow run state, consecutive failures rarely happen when there is an external dependency problem. e.g. network, cluster fail (including limited resource), and docker registry throttling. cc/ @nicolejms @sylvainsf