kubeflow / katib

Automated Machine Learning on Kubernetes

Home Page:https://www.kubeflow.org/docs/components/katib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Flaky Test: Trial status is succeeded and metrics are properly populated

tenzen-y opened this issue · comments

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Flaky Test: "Expect that Trial status is succeeded and metrics are properly populated Metrics available because GetTrialObservationLog returns values":

// Expect that Trial status is succeeded and metrics are properly populated
// Metrics available because GetTrialObservationLog returns values
g.Eventually(func() bool {
if err = c.Get(ctx, trialKey, trial); err != nil {
return false
}
return trial.IsSucceeded() &&
len(trial.Status.Observation.Metrics) > 0 &&
trial.Status.Observation.Metrics[0].Min == "0.11" &&
trial.Status.Observation.Metrics[0].Max == "0.99" &&
trial.Status.Observation.Metrics[0].Latest == "0.11"
}, timeout).Should(gomega.BeTrue())

--- FAIL: TestReconcileBatchJob (82.19s)
    trial_controller_test.go:274: 
        Timed out after 80.001s.
        Expected
            <bool>: false
        to be true
FAIL
	github.com/kubeflow/katib/pkg/controller.v1beta1/trial	coverage: 83.6% of statements

https://github.com/kubeflow/katib/actions/runs/8125174959/job/22207477654?pr=2267#step:4:106

What did you expect to happen:
No errors occur.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Katib version (check the Katib controller image version):
  • Kubernetes version: (kubectl version):
  • OS (uname -a):

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

May I work for this bug?
/assign

I consider it should work by just modifying timeout threshold from 80 to 100.

I consider it should work by just modifying timeout threshold from 80 to 100.

I don't think so. Despite we applied a similar approach, this issue still remains.

I consider it should work by just modifying timeout threshold from 80 to 100.

I don't think so. Despite we applied a similar approach, this issue still remains.

Seem that it's an interesting issue. I would do some surveys and try working on it.