kubeflow / katib

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Flaky Test: "Expect that Trial status is succeeded and metrics are properly populated Metrics available because GetTrialObservationLog returns values":

katib/pkg/controller.v1beta1/trial/trial_controller_test.go

Lines 263 to 274 in 8df3c5c

    
           // Expect that Trial status is succeeded and metrics are properly populated 
        
           // Metrics available because GetTrialObservationLog returns values 
        
           g.Eventually(func() bool { 
        
           	if err = c.Get(ctx, trialKey, trial); err != nil { 
        
           		return false 
        
           	} 
        
           	return trial.IsSucceeded() && 
        
           		len(trial.Status.Observation.Metrics) > 0 && 
        
           		trial.Status.Observation.Metrics[0].Min == "0.11" && 
        
           		trial.Status.Observation.Metrics[0].Max == "0.99" && 
        
           		trial.Status.Observation.Metrics[0].Latest == "0.11" 
        
           }, timeout).Should(gomega.BeTrue())

--- FAIL: TestReconcileBatchJob (82.19s)
    trial_controller_test.go:274: 
        Timed out after 80.001s.
        Expected
            <bool>: false
        to be true
FAIL
	github.com/kubeflow/katib/pkg/controller.v1beta1/trial	coverage: 83.6% of statements

https://github.com/kubeflow/katib/actions/runs/8125174959/job/22207477654?pr=2267#step:4:106

What did you expect to happen:
No errors occur.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Katib version (check the Katib controller image version):
Kubernetes version: (kubectl version):
OS (uname -a):

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

May I work for this bug?
/assign

I consider it should work by just modifying timeout threshold from 80 to 100.

I consider it should work by just modifying timeout threshold from 80 to 100.

I don't think so. Despite we applied a similar approach, this issue still remains.

I consider it should work by just modifying timeout threshold from 80 to 100.

I don't think so. Despite we applied a similar approach, this issue still remains.

Seem that it's an interesting issue. I would do some surveys and try working on it.

	// Expect that Trial status is succeeded and metrics are properly populated
	// Metrics available because GetTrialObservationLog returns values
	g.Eventually(func() bool {
	if err = c.Get(ctx, trialKey, trial); err != nil {
	return false
	}
	return trial.IsSucceeded() &&
	len(trial.Status.Observation.Metrics) > 0 &&
	trial.Status.Observation.Metrics[0].Min == "0.11" &&
	trial.Status.Observation.Metrics[0].Max == "0.99" &&
	trial.Status.Observation.Metrics[0].Latest == "0.11"
	}, timeout).Should(gomega.BeTrue())

Flaky Test: Trial status is succeeded and metrics are properly populated