MissingBatchTask failure due to an order-of-operations error in BatchScheduler.
BMurri opened this issue · comments
Describe the bug
We are seeing several reports of MissingBatchTask
failures. It turns out that they are caused by an incorrect assumption about a lack of failure when updating status details in the TesTask
object preparatory to updating the task repository. As a result, the task is still "active" in the repository yet is no longer present in the batch job.
Steps to Reproduce
Run a bunch of tasks where some return failures in the compute node.
Expected behavior
All reportable state associated with the TesTask
are written to the repository whether or not subsequent failures happen while completing the degree of that state.
Batch tasks are not removed from the batch job until the TesTask
has been placed into a terminal status.
Additional context
This was discovered while evaluating #618