Parsl / parsl

Parsl - a Python parallel scripting library

Home Page:http://parsl-project.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Uncontrolled scaling behavior in response to scheduler blip mid-run

yadudoc opened this issue · comments

Describe the bug

Erik Husby @ehusby and co. from the Arctic/EarthDEM team reports that a Parsl run spammed the scheduler on Frontera with jobs and with the admins later banning the accounts in response. Here are the highlights from the logs:

Parsl had 42 jobs running and processing tasks, followed by what I believe is a scheduler blip which threw the rest of the run.

<COMMENT>  Log snippet showing 42 running blocks connected : </COMMENT>

2023-06-09 16:13:12.017 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 42/0 running/pending blocks, and 160 connected workers

<COMMENT> Losing all of them in a few minutes: </COMMENT> 

2023-06-09 16:14:06.935 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 42/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:06.935 parsl.dataflow.strategy:248 [DEBUG] Requesting 1 more blocks
2023-06-09 16:14:12.016 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:12.016 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks

<COMMENT> This indicates that something broke the jobs, which is later reported: </COMMENT>

2023-06-09 16:16:33.568 parsl.app.errors:126 [DEBUG] Reraising exception of type <class 'parsl.executors.high_throughput.interchange.ManagerLost'>
Several failed attempts to provision jobs :2023-06-09 16:14:12.016 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks
2023-06-09 16:14:16.935 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:16.936 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks
2023-06-09 16:14:21.936 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:21.936 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks
2023-06-09 16:14:26.936 parsl.dataflow.strategy:188 [DEBUG] Executor frontera_htex_s2s_normal has 3382 active tasks, 0/0 running/pending blocks, and 160 connected workers
2023-06-09 16:14:26.936 parsl.dataflow.strategy:248 [DEBUG] Requesting 43 more blocks

We can confirm these points:

  • We had 42 jobs running
  • Jobs had workers connected to the interchange over the n/w
  • Jobs died abruptly mid-run (I suspect the scheduler broke down)
  • Lost managers were detected via missing heartbeats
  • Strategy attempted resubmitting jobs which failed repeatedly.
  • It is likely that the scheduler was borked for some period of time.

To Reproduce

I don't have a good way to simulate this failure.

Expected behavior

We expect Parsl scaling to detect consecutive failures to provision and shutdown.

Distributed Environment

  • Observed on Frontera / Slurm
  • There are reports from @ryanchard on similar behavior on ALCF resources, although those are not necessarily failures mid-run.

Our default JobErrorHandler.simple_error_handler looks like the culprit in this case:

This function will trigger the shutdown via set_bad_state_and_fail_all(..) only if the # of failed_jobs equals total jobs. If at least one job ran correctly and is not failed, then this method won't trigger a shutdown.

    def simple_error_handler(self, executor: ParslExecutor, status: Dict[str, JobStatus], threshold: int):
        (total_jobs, failed_jobs) = self.count_jobs(status)
        if total_jobs >= threshold and failed_jobs == total_jobs:
            executor.set_bad_state_and_fail_all(self.get_error(status))

For synthetic reproductions I've sometimes subclassed/forked localprovider to give it the appropriate behaviour

Towards the end of the last century, exponential backoff in failure situations was all the rage, and that might be interesting here: rather than abandon your workflow because the scheduler was not responding for 5h, instead slow down submitting and potentially not start running again for 10h

Jobs died abruptly mid-run (I suspect the scheduler broke down)

Do we have any indication in our framework of how many jobs have died recently? Seems like an opportunity to play nice with the scheduler and the wider system if we recognize that a large number of jobs have recently failed — could possibly just reduce the exponential backoff waiting time to a slower trickle by submitting one-off "test the water" jobs to see if the system is up and running again.

  1. Recognize lots of failures, trigger "trickle checker"/"test the waters" workflow
  2. Set wait_for_s = 1
  3. Set wait_for_s = min(300, wait_for_s * 2) # or some max "trickle-value"
  4. sleep(wait_for_s) (Or moral equivalent of)
  5. Submit simple noop()
  6. If failure, go to step 3
  7. Success? Start submitting regular jobs again!

@khk-globus there's no exponential backoff. there should be, i think, as an alternative to "the batch system was down for 3 x 5 second periods, abandon hope".