Change Windows CI auto-retry to improve UX

Question

Change Windows CI auto-retry to improve UX

dagood opened this issue 10 months ago · comments

In the last release, there were a couple issues I hit:

Test logs weren't uploaded when a timeout occurred.
When a job is running, attempting to scroll up in the currently-running step doesn't work because every few seconds, the general job-init log is displayed and then it returns to showing the running log. Scroll, search, etc. is reset. (I believe this is a point-in-time bug, but this kind of behavior has occurred in other situations to me before.) A workaround is to cancel the job, but at this point in the job, I don't know for sure whether the job is simply slow, or the job has actually attempted the tests several times due to a non-flaky failure. I don't want to waste time by canceling a job that may succeed.

The first thing that comes to mind is reimplementing the retries as a sequence of pipeline steps. We do 5 retries right now, so we could have 5 "build" steps that each tries to build the repo if it hasn't been successful yet, or skip if it has worked.

Similarly, 5 "test" steps would run, and a "upload results" step could be added after each test step. This also would give us results sooner than we currently get them, to make investigation easier to do early.

A "complete" pipeline step has more stable logs, the ability to download raw logs, etc., so splitting up steps could generally help quite a bit.

This may also make it easier to set up tooling that keeps track of how much retrying is being done in our builds, by elevating it to the AzDO API level rather than needing to parse logs.