Parsl / parsl

Parsl - a Python parallel scripting library

Home Page:http://parsl-project.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hanging batch system commands prevent further scaling and cause shutdown hangs

benclifford opened this issue · comments

Describe the bug
When a batch system command hangs, this blocks the JobStatusPoller thread for as long that command is hung. If that command is hung forever, as seems to happen on some batch systems, that thread is blocked forever.

This prevents anything from happening that uses up to date status information, and causes a hang at workflow shutdown for as long as the command is hung.

This comes from some investigation at in2p3 of occasionally hanging-at-exit workflows.

To Reproduce

Apply this

--- a/parsl/providers/local/local.py
+++ b/parsl/providers/local/local.py
@@ -71,6 +71,7 @@ class LocalProvider(ExecutionProvider, RepresentationMixin):
             - List of status codes.
 
         '''
+        time.sleep(300)
 
         for job_id in self.resources:
             # This job dict should really be a class on its own

Run some tests:

pytest parsl/tests/ --config parsl/tests/configs/htex_local.py  -k 'not cleannet'

Observe that the test suite hangs at the end for around 300 seconds.

and the logs show a delay like this:

1690550387.554728 2023-07-28 13:19:47 MainProcess-8351 MainThread-140645426521920 parsl.dataflow.df
low:1193 cleanup INFO: Closing job status poller
1690550681.530954 2023-07-28 13:24:41 MainProcess-8351 JobStatusPoller-Timer-Thread-140644975995408
-140644973938368 parsl.jobs.strategy:142 _strategy_noop DEBUG: strategy_noop: doing nothing
1690550681.531402 2023-07-28 13:24:41 MainProcess-8351 MainThread-140645426521920 parsl.dataflow.dflow:1195 cleanup INFO: Terminated job status poller

Expected behavior
This stuff shouldn't hang forever if the batch system hangs forever.

Environment
parsl 2023.07.24, commit 41357c6
my laptop

changed my mind: all these batch system commands should be protected by cmd_timeout