kubeflow / katib

Automated Machine Learning on Kubernetes

Home Page:https://www.kubeflow.org/docs/components/katib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update experiment instance status failed: the object has been modified

Antsypc opened this issue · comments

/kind bug

What steps did you take and what happened:

I got error when update experiment status in experiment controller.

{"level":"info","ts":"2024-03-04T01:39:38Z","logger":"experiment-controller","msg":"Update experiment instance status failed, reconciler requeued","Experiment":{"name":"a10702550312415232282375","namespace":"heros-user"},"err":"Operation cannot be fulfilled on experiments.kubeflow.org \"a10702550312415232282375\": the object has been modified; please apply your changes to the latest version and try again"}

What did you expect to happen:

The code of experiment status update as follow. It's not supposed to raise error cause it only updates status even if experiment object is modified. I'm not sure my understanding is ok.
https://github.com/kubeflow/katib/blob/master/pkg/controller.v1beta1/experiment/experiment_controller.go#L237

	if !equality.Semantic.DeepEqual(original.Status, instance.Status) {
		// assuming that only status change
		err = r.updateStatusHandler(instance)
		if err != nil {
			logger.Info("Update experiment instance status failed, reconciler requeued", "err", err)
			return reconcile.Result{
				Requeue: true,
			}, nil
		}
	}

Environment:

  • Katib version: v0.16
  • Kubernetes version: v1.25.13
  • OS: Linux 5.15.47-1.el7.x86_64 x86_64

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

@Antsypc Thank you for creating this issue!
This is intended behavior, which means this is not a bug. Once the controller faces the updating error due to some conflicts, the controller re-queues the experiment.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.