Support for restart of a runner

Question

Support for restart of a runner

yorugac opened this issue 8 months ago · comments

If runner pod is restarted before test execution is completed, the new runner does not receive a starting signal from starter and remains paused while the operator waits indefinitely.

It's unclear how often this happens in real-life scenarios but there's such a possibility and this is the expected behaviour at the moment.

What the operator should do in such scenario? Some potential ideas:

ignore the restarted pod and try to finish the test with n - 1 pods. Cons: it's not easy to implement correctly (without breaking existing functionality) if possible at all.
start the test on the restarted runner and let it finish. Cons: this will make the test longer than estimated and will skew the results in many cases.
fail the test on some timeout. This is the one that will likely get implemented as consequence of issue #222

Another caveat is that there can be more than 1 restarting pod. In case of large failure of node group, there could be all runners failing, in theory. In which case, the 3rd option seems as the one making the most sense.

Opening this issue as a follow-up from #138. Feedback and thoughts are welcome.