Report failures of periodic jobs to the cluster-api Slack channel

Question

Report failures of periodic jobs to the cluster-api Slack channel

sbueringer opened this issue 5 months ago · comments

I noticed that CAPO is reporting periodic test failures to Slack, e.g.: https://kubernetes.slack.com/archives/CFKJB65G9/p1713540048571589

I think think this is a great way to surface issues with CI (and also folks can directly start a thread based on a Slack comment like this)

This could be configured ~ like this: https://github.com/kubernetes/test-infra/blob/5d7e1db75dce28537ba5f17476882869d1b94b0a/config/jobs/kubernetes-sigs/cluster-api-provider-openstack/cluster-api-provider-openstack-periodics.yaml#L48-L55

What do you think?

Stefan Büringer · Answer 1 · Fri Apr 26 2024 15:52:19 GMT+0800 (China Standard Time)

cc @chrischdi @fabriziopandini

Kubernetes Prow Robot · Answer 2 · Fri Apr 26 2024 15:52:21 GMT+0800 (China Standard Time)

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Christian Schlotter · Answer 3 · Fri Apr 26 2024 16:56:35 GMT+0800 (China Standard Time)

Oh wow, yeah that would be a great thing. I just fear that it may pollute the channel too much. But we could try and fail fast by asking for feedback if it is too much later on in the community meeting or via a slack thread/poll.

killianmuldoon · Answer 4 · Fri Apr 26 2024 17:03:20 GMT+0800 (China Standard Time)

do we know if this respects testgrid-num-failures-to-alert? If so it could be great.

Stefan Büringer · Answer 5 · Fri Apr 26 2024 17:08:40 GMT+0800 (China Standard Time)

I'm not sure if it respects that. We could try and rollback if it doesn't?

Stefan Büringer · Answer 6 · Fri Apr 26 2024 17:09:37 GMT+0800 (China Standard Time)

If it still pollutes the channel too much after considering testgrid-num-failures-to-alert we have to focus more on CI :D

(I"m currently guessing that we would get one Slack message for every mail that we get today, but I don't know)

killianmuldoon · Answer 7 · Fri Apr 26 2024 17:10:14 GMT+0800 (China Standard Time)

One slack message per mail would be perfect - more would disrupt the channel

WDYT about enabling it for CAPV first?

killianmuldoon · Answer 8 · Fri Apr 26 2024 17:10:46 GMT+0800 (China Standard Time)

Also fine with making the change and rolling back if it doesn't work

Stefan Büringer · Answer 9 · Fri Apr 26 2024 17:13:11 GMT+0800 (China Standard Time)

One slack message per mail would be perfect - more would disrupt the channel
WDYT about enabling it for CAPV first?

Fine for me, we can also ask the OpenStack folks how spamy it is for them today (cc @mdbooth @lentzi90)

Lennart Jern · Answer 10 · Fri Apr 26 2024 17:17:56 GMT+0800 (China Standard Time)

For CAPO we get a message for every failure and email only after 2 failures in a row. I think it has been tolerable for us, but that indicates it does not check testgrid-num-failures-to-alert (at least the way we have it configured)

Stefan Büringer · Answer 11 · Fri Apr 26 2024 17:24:06 GMT+0800 (China Standard Time)

Hm okay, every failure is just too much. So we should probably take a closer look at the configuration / implementation. One message for every failure just doesn't make sense for the amount of tests/failures we have (the signal/noise ratio is just wrong)

Fabrizio Pandini · Answer 12 · Mon Apr 29 2024 19:56:59 GMT+0800 (China Standard Time)

+1 to test this if we find a config reasonably noisy (but not too much noisy)
cc @kubernetes-sigs/cluster-api-release-team

/priority backlog
/kind feature

Adil Ghaffar · Answer 13 · Mon Apr 29 2024 20:09:10 GMT+0800 (China Standard Time)

+1 from my side too. Tagging CI lead @Sunnatillo
I will add this to improvement tasks for v1.8 cycle. CI team can look into this one.

Sunnat Samadov · Answer 14 · Tue Apr 30 2024 14:55:33 GMT+0800 (China Standard Time)

Sounds great. I will take a look

Sunnat Samadov · Answer 15 · Thu May 30 2024 21:54:08 GMT+0800 (China Standard Time)

I guess this testgrid-num-failures-to-alert should help with the amount of the noise. If we set it, for example, to 5 we will be sure that we will receive messages about constantly failing tests. This makes the config to sent the alert after 5 continuous failures.

Sunnat Samadov · Answer 16 · Thu May 30 2024 21:55:17 GMT+0800 (China Standard Time)

/assign @Sunnatillo

Lennart Jern · Answer 17 · Fri May 31 2024 13:23:35 GMT+0800 (China Standard Time)

@Sunnatillo testgrid-num-failures-to-alert does not affect the slack messages for CAPO at least. Only emails are affected by that in my experience.

Sunnat Samadov · Answer 18 · Fri May 31 2024 17:41:30 GMT+0800 (China Standard Time)

@Sunnatillo testgrid-num-failures-to-alert does not affect the slack messages for CAPO at least. Only emails are affected by that in my experience.

Thank you for the update. I will open the issue in test-infra, try to find the way to do it.

Sunnat Samadov · Answer 19 · Mon Jun 03 2024 15:20:32 GMT+0800 (China Standard Time)

I opened an issue regarding this in test-infra:
kubernetes/test-infra#32687