Recover lost batches

Question

Recover lost batches

sleekweasel opened this issue 8 years ago · comments

If a batch-run times out or otherwise fails to return results for some/all the tests in the batch, we should consider re-queueing the tests. (Implies maintaining the currently running tests in the database.)

A test should not be re-queued endlessly - it's probably genuinely timing out or killing its agent. (Implies multiple queues.)

Re-queued tests are re-run in isolation, to separate bad tests from innocent batch-mates. (Implies workers knows about secondary queuing.)

We should recover even (especially) if the agent is killed with extreme prejudice. (Implies worker-tracking.)

Workers should not terminate until the queue is empty and all workers are idle. (Implies coordination)

Proposal:

Workers should use transactions (http://redis.io/topics/transactions) to pull 'n' tests off the primary queue (or only 1 from the requeue) and into their own set, and then run them. Once the run is finished, any tests from the primary queue that weren't executed for any reason are added to the requeue.
Worker-controller maintains a set listing each worker, removing a worker from the set when it terminates. If a worker terminates with tests in its set, worker-controller adds those tests to the requeue.
The worker-controller polls for an empty queue and requeue, and for all worker sets to be empty, whereupon the worker-controller puts a 'tests complete' marker in a controller set and workers terminate in response.
The various queues and sets have names based on that of the primary queue - e.g. queue, queue_requeue, queue_worker0, queue_control. These will be ensured empty at start-up by the processes using them (in case of previous catastrophic failure) but should be naturally empty by the end of a normally completed run.