microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MissingBatchTask failure due to an order-of-operations error in BatchScheduler.

BMurri opened this issue · comments

Describe the bug
We are seeing several reports of MissingBatchTask failures. It turns out that they are caused by an incorrect assumption about a lack of failure when updating status details in the TesTask object preparatory to updating the task repository. As a result, the task is still "active" in the repository yet is no longer present in the batch job.

Steps to Reproduce
Run a bunch of tasks where some return failures in the compute node.

Expected behavior
All reportable state associated with the TesTask are written to the repository whether or not subsequent failures happen while completing the degree of that state.
Batch tasks are not removed from the batch job until the TesTask has been placed into a terminal status.

Additional context
This was discovered while evaluating #618