Uncontrolled scaling behavior in response to scheduler blip mid-run

Question

Uncontrolled scaling behavior in response to scheduler blip mid-run

yadudoc opened this issue a year ago · comments

Describe the bug

Erik Husby @ehusby and co. from the Arctic/EarthDEM team reports that a Parsl run spammed the scheduler on Frontera with jobs and with the admins later banning the accounts in response. Here are the highlights from the logs:

Parsl had 42 jobs running and processing tasks, followed by what I believe is a scheduler blip which threw the rest of the run.

<COMMENT>  Log snippet showing 42 running blocks connected : </COMMENT>

2023-06-09 16:13:12.017 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 42/0 running/pending blocks, and 160 connected workers

<COMMENT> Losing all of them in a few minutes: </COMMENT> 

2023-06-09 16:14:06.935 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 42/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:06.935 parsl.dataflow.strategy:248 [DEBUG] Requesting 1 more blocks
2023-06-09 16:14:12.016 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:12.016 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks

<COMMENT> This indicates that something broke the jobs, which is later reported: </COMMENT>

2023-06-09 16:16:33.568 parsl.app.errors:126 [DEBUG] Reraising exception of type <class 'parsl.executors.high_throughput.interchange.ManagerLost'>
Several failed attempts to provision jobs :2023-06-09 16:14:12.016 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks
2023-06-09 16:14:16.935 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:16.936 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks
2023-06-09 16:14:21.936 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:21.936 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks
2023-06-09 16:14:26.936 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:26.936 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks

We can confirm these points:

We had 42 jobs running
Jobs had workers connected to the interchange over the n/w
Jobs died abruptly mid-run (I suspect the scheduler broke down)
Lost managers were detected via missing heartbeats
Strategy attempted resubmitting jobs which failed repeatedly.
It is likely that the scheduler was borked for some period of time.

To Reproduce

I don't have a good way to simulate this failure.

Expected behavior

We expect Parsl scaling to detect consecutive failures to provision and shutdown.

Distributed Environment

Observed on Frontera / Slurm
There are reports from @ryanchard on similar behavior on ALCF resources, although those are not necessarily failures mid-run.

Yadu Nand Babuji · Answer 1 · Thu Aug 03 2023 02:55:16 GMT+0800 (China Standard Time)

Our default JobErrorHandler.simple_error_handler looks like the culprit in this case:

This function will trigger the shutdown via set_bad_state_and_fail_all(..) only if the # of failed_jobs equals total jobs. If at least one job ran correctly and is not failed, then this method won't trigger a shutdown.

    def simple_error_handler(self, executor: ParslExecutor, status: Dict[str, JobStatus], threshold: int):
        (total_jobs, failed_jobs) = self.count_jobs(status)
        if total_jobs >= threshold and failed_jobs == total_jobs:
            executor.set_bad_state_and_fail_all(self.get_error(status))

Ben Clifford · Answer 2 · Thu Aug 03 2023 05:12:31 GMT+0800 (China Standard Time)

For synthetic reproductions I've sometimes subclassed/forked localprovider to give it the appropriate behaviour

Ben Clifford · Answer 3 · Thu Aug 03 2023 05:14:35 GMT+0800 (China Standard Time)

Towards the end of the last century, exponential backoff in failure situations was all the rage, and that might be interesting here: rather than abandon your workflow because the scheduler was not responding for 5h, instead slow down submitting and potentially not start running again for 10h

Kevin Hunter Kesling · Answer 4 · Wed Aug 09 2023 00:32:22 GMT+0800 (China Standard Time)

Jobs died abruptly mid-run (I suspect the scheduler broke down)

Do we have any indication in our framework of how many jobs have died recently? Seems like an opportunity to play nice with the scheduler and the wider system if we recognize that a large number of jobs have recently failed — could possibly just reduce the exponential backoff waiting time to a slower trickle by submitting one-off "test the water" jobs to see if the system is up and running again.

Recognize lots of failures, trigger "trickle checker"/"test the waters" workflow
Set wait_for_s = 1
Set wait_for_s = min(300, wait_for_s * 2) # or some max "trickle-value"
sleep(wait_for_s) (Or moral equivalent of)
Submit simple noop()
If failure, go to step 3
Success? Start submitting regular jobs again!

Ben Clifford · Answer 5 · Wed Aug 09 2023 00:33:54 GMT+0800 (China Standard Time)

@khk-globus there's no exponential backoff. there should be, i think, as an alternative to "the batch system was down for 3 x 5 second periods, abandon hope".