Parsl / parsl

Parsl - a Python parallel scripting library

Home Page:http://parsl-project.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Newly frequent WorkQueueTaskFailure in CI

benclifford opened this issue · comments

Describe the bug

I'm seeing this WorkQueueExecutor heisenbug happen in CI a lot recently: I'm not clear what has changed to make it happen more - for example in https://github.com/Parsl/parsl/actions/runs/6518865549/job/17704749713

ERROR    parsl.dataflow.dflow:dflow.py:350 Task 207 failed after 0 retry attempts
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/parsl/dataflow/dflow.py", line 301, in handle_exec_update
    res = self._unwrap_remote_exception_wrapper(future)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/parsl/parsl/parsl/dataflow/dflow.py", line 571, in _unwrap_remote_exception_wrapper
    result = future.result()
             ^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.5/x64/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
parsl.executors.workqueue.errors.WorkQueueTaskFailure: ('work queue result: The result file was not transfered from the worker.\nThis usually means that there is a problem with the python setup,\nor the wrapper that executes the function.\nTrace:\n', FileNotFoundError(2, 'No such file or directory'))
INFO     parsl.dataflow.dflow:dflow.py:1390 Standard output for task 207 available at std.out

I'm don't have any immediate strong ideas about what is going on - I've had a little poke but can't see anything that sticks out right away.

I've opened:

  • PR #2912 to try a newer cctools
  • draft PR #2910 to try to capture more FileNotFoundError information in output - there is more stuff in that FileNotFoundError (such as the actual filename) that isn't rendered by the above error reporting

I haven't been successful in recreating this on my laptop. However I have seen a related error on perlmutter under certain high load / high concurrency conditions which is a bit more recreatable and maybe I can debug from there.

cc @dthain

maybe related, maybe not, I've also seen this in CI - it looks something to do with staging files in, not out? see https://github.com/Parsl/parsl/actions/runs/6519478342/job/17706018626

E               parsl.executors.errors.BadStateException: Executor WorkQueueExecutor failed due to: Error 1:
E               	EXIT CODE: 139
E               	STDOUT: Found cores : 2
E               Launching worker: 1
E               work_queue_worker: creating workspace /tmp/worker-1001-5848
E               work_queue_worker: using 2 cores, 6932 MB memory, 18382 MB disk, 0 gpus
E               connected to manager fv-az201-276:9000 via local address 10.1.0.39:38854
E               
E               	STDERR: Network function: connection from ('127.0.0.1', 50818)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50824)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50828)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 50834)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 40740)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 40756)
E               Network function: recieved event: {'fn_
E               ...
E               ': 'direct'}
E               Network function: connection from ('127.0.0.1', 38228)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function: connection from ('127.0.0.1', 38236)
E               Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result'], 'remote_task_exec_method': 'direct'}
E               Network function encountered exception  [Errno 2] No such file or directory: 't.271'
E               Traceback (most recent call last):
E                 File "/opt/hostedtoolcache/Python/3.8.18/x64/bin/parsl_coprocess.py", line 141, in <module>
E                   main()
E                 File "/opt/hostedtoolcache/Python/3.8.18/x64/bin/parsl_coprocess.py", line 69, in main
E                   task_id = int(input_spec[1])
E               IndexError: list index out of range
E               /home/runner/work/parsl/parsl/runinfo/003/submit_scripts/parsl.WorkQueueExecutor.block-0.1697310662.3360648.sh: line 10:  5848 Segmentation fault      (core dumped) PARSL_WORKER_BLOCK_ID=0 work_queue_worker --coprocess parsl_coprocess.py fv-az201-276 9000

I've tried my DESC development branch of parsl with ndcctools 7.7.0 and still experience sporadic FileNotFound errors as reported in the main body of this issue.

So that error is almost certainly coming from this line, where the coprocess attempts to chdir to the task directory (t.271) corresponding to the function-call task:
https://github.com/cooperative-computing-lab/cctools/blob/master/poncho/src/poncho/wq_network_code.py#L75

Now, it's hard for me to imagine that the directory does not really exist bc/ the worker creates it before sending the function to the coprocess. But, it would be wise for the coprocess to check this and send back an error message.

But, I think the problem is really that the coprocess doesn't do the complementary chdir(..) under all exit paths. For example, if the coprocess catches an exception, it skips the .. on the way out. So I think we need a more idempotent approach to always return to the same absolute directory each time through the loop.

@tphung3 what do you think?

@benclifford I just merged a fix to the chdir error (see cooperative-computing-lab/cctools#3542), what's the quickest way to see if it works?

@tphung3 if you have a URL for a binary of cctools (from anywhere, doesn't need to be an official release) it is hopefully easy to make a branch of parsl, edit the install path for ndcctools, hack the dependency problem mentioned elsewhere and see what happens

On the desc parsl branch, I'm still seeing some segfaults and other work queue problems, for example here:

https://github.com/Parsl/parsl/actions/runs/6668296438/job/18123571251?pr=2012#step:6:9134

I don't have a feel for if this is something that is breaking in the parsl branch-specific functionality which is then breaking things in WQ, or what else is going on - so I'm just noting that error here for now.

It looks like this test is running cctools 7.7.1, but the fix for that segfault is in 7.7.2:
https://github.com/cooperative-computing-lab/cctools/releases/tag/release%2F7.7.2

ok, easy to bump that branch up by 0.0.1 - I'll do that now

I'm still seeing this in the desc branch of parsl in CI sometimes:

Network function: connection from ('127.0.0.1', 60014)
Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result', 'log'], 'remote_task_exec_method': 'direct'}
Network function: connection from ('127.0.0.1', 60020)
Network function: recieved event: {'fn_kwargs': {}, 'fn_args': ['map', 'function', 'result', 'log'], 'remote_task_exec_method': 'direct'}
Network function encountered exception  [Errno 2] No such file or directory: 't.107'
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/.venv/bin/parsl_coprocess.py", line 135, in <module>
    main()
  File "/home/runner/work/parsl/parsl/.venv/bin/parsl_coprocess.py", line 68, in main
    task_id = int(input_spec[1])
IndexError: list index out of range
/home/runner/work/parsl/parsl/runinfo/003/submit_scripts/parsl.WorkQueueExecutor.block-0.1699912866.767312.sh: line 10:  6144 Segmentation fault      (core dumped) PARSL_WORKER_BLOCK_ID=0 work_queue_worker --coprocess parsl_coprocess.py fv-az340-503 9000

https://github.com/Parsl/parsl/actions/runs/6854767971/job/18642922623?pr=2012#step:7:1883

This is with CCTOOLS_VERSION=7.7.2

Hmm, that is surprising -- @tphung3 will look into it.
We are Supercomputing in Denver this week, may be a bit delayed.

Ok, I think we see where the problem is, let me bring in @colinthomas-z80 who is going to sort things out.

It appears this was fixed in the cctools library code but didn't get moved over here. See above PR

Would it be feasible to include the generation of parsl_coprocess.py somewhere in the build process?

I would like that. I don't know enough about Python build/install to know how to do it, but some packages manage to compile C, etc so I'll guess it's possible.

Let's do this generation at runtime by running poncho_package_serverize appropriately, which is what we do in native TaskVine applications.