possible race condition at startup of timeout test

Question

possible race condition at startup of timeout test

benclifford opened this issue a year ago · comments

Describe the bug
I saw this one off failure in CI - what I think happens is in very fast cases, the app timeout is firing "immediately", still inside the time thread starting code, and this is breaking things.

tag @khk-globus who touched this last.

PYTHONPATH=/tmp/cctools/lib/python3.8/site-packages pytest parsl/tests/ -k "not cleannet and not issue363" --config parsl/tests/configs/workqueue_ex.py --random-order --durations 10
============================= test session starts ==============================
platform linux -- Python 3.11.4, pytest-7.4.0, pluggy-1.2.0
Using --random-order-bucket=module
Using --random-order-seed=977312

rootdir: /home/runner/work/parsl/parsl/parsl/tests
configfile: pytest.ini
plugins: random-order-1.1.0, typeguard-4.0.0, cov-4.1.0
collected 268 items / 30 deselected / 2 skipped / 238 selected

parsl/tests/test_error_handling/test_retry_handler.py s parsl/tests/test_flux.py ssssssssss parsl/tests/test_error_handling/test_htex_missing_worker.py s parsl/tests/test_python_apps/test_memoize_joinapp.py . parsl/tests/test_python_apps/test_futures.py .... parsl/tests/test_python_apps/test_type5.py .. parsl/tests/test_checkpointing/test_python_checkpoint_2.py s parsl/tests/test_staging/test_docs_2.py s parsl/tests/test_docs/test_workflow3.py . parsl/tests/test_docs/test_from_slides.py . parsl/tests/test_checkpointing/test_regression_232.py ss parsl/tests/site_tests/test_provider.py s parsl/tests/test_docs/test_workflow2.py s parsl/tests/test_bash_apps/test_keyword_overlaps.py .. parsl/tests/test_python_apps/test_simple.py .. parsl/tests/test_error_handling/test_wrap_with_logs.py ss parsl/tests/test_staging/test_elaborate_noop_file.py ss parsl/tests/sites/test_affinity.py s parsl/tests/test_thread_parallelism.py ss parsl/tests/test_staging/test_staging_https_in_task.py s parsl/tests/test_regression/test_1606_wait_for_current_tasks.py parsl/tests/test_python_apps/test_memoize_bad_id_for_memo.py parsl/tests/test_error_handling/test_resource_spec.py . parsl/tests/test_regression/test_854.py ss parsl/tests/test_python_apps/test_pipeline.py .... parsl/tests/test_serialization/test_2555_caching_d parsl/tests/test_staging/test_1316.py ss parsl/tests/test_python_apps/test_overview.py . parsl/tests/test_scaling/test_scale_down.py s parsl/tests/test_python_apps/test_join.py ......... parsl/tests/test_python_apps/test_timeout.py [ 0%]
[ 4%]
[ 5%]
[ 5%]
[ 7%]
[ 7%]
[ 8%]
[ 8%]
[ 9%]
[ 9%]
[ 10%]
[ 10%]
[ 11%]
[ 12%]
[ 13%]
[ 13%]
[ 14%]
[ 15%]
[ 15%]
[ 16%]
s [ 16%]
.... [ 18%]
[ 18%]
[ 19%]
[ 21%]
eserializer.py s [ 21%]
[ 22%]
[ 23%]
[ 23%]
[ 27%]
.....F

=================================== FAILURES ===================================
_____________________________ test_timeout[0.005] ______________________________

timeout_dur = 0.005

@pytest.mark.parametrize("timeout_dur", (0.005, 0.01, 0.1))
def test_timeout(timeout_dur):

  timeout(kernel, timeout_dur)()

parsl/tests/test_python_apps/test_timeout.py:20:

parsl/app/python.py:31: in wrapper
with AutoCancelTimer(seconds, inject_exception, args=[thread]):
parsl/utils.py:371: in enter
self.start()
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/threading.py:962: in start
self._started.wait()
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/threading.py:622: in wait
signaled = self._cond.wait(timeout)

self = <Condition(<unlocked _thread.lock object at 0x7efcb0cad180>, 0)>
timeout = None

def wait(self, timeout=None):
    """Wait until notified or until a timeout occurs.

    If the calling thread has not acquired the lock when this method is
    called, a RuntimeError is raised.

    This method releases the underlying lock, and then blocks until it is
    awakened by a notify() or notify_all() call for the same condition
    variable in another thread, or until the optional timeout occurs. Once
    awakened or timed out, it re-acquires the lock and returns.

    When the timeout argument is present and not None, it should be a
    floating point number specifying a timeout for the operation in seconds
    (or fractions thereof).

    When the underlying lock is an RLock, it is not released using its
    release() method, since this may not actually unlock the lock when it
    was acquired multiple times recursively. Instead, an internal interface
    of the RLock class is used, which really unlocks it even when it has
    been recursively acquired several times. Another internal interface is
    then used to restore the recursion level when the lock is reacquired.

    """
    if not self._is_owned():
        raise RuntimeError("cannot wait on un-acquired lock")
    waiter = _allocate_lock()
    waiter.acquire()
    self._waiters.append(waiter)
    saved_state = self._release_save()
    gotit = False
    try:    # restore state no matter what (e.g., KeyboardInterrupt)
        if timeout is None:

          waiter.acquire()

E parsl.app.errors.AppTimeout

/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/threading.py:320: AppTimeout
---------------------------- Captured log teardown -----------------------------
INFO parsl.dataflow.dflow:dflow.py:1145 Waiting for all remaining tasks to complete
INFO parsl.dataflow.dflow:dflow.py:1158 All remaining tasks completed
=============================== warnings summary ===============================
parsl/executors/workqueue/executor.py:46
/home/runner/work/parsl/parsl/parsl/executors/workqueue/executor.py:46: DeprecationWarning: 'import work_queue' is deprecated. Please instead use: 'import ndcctools.work_queue'
import work_queue as wq

test_docs/test_workflow3.py::test_parallel_for
/opt/hostedtoolcache/Python/3.11.4/x64/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but test_docs/test_workflow3.py::test_parallel_for returned [1, 2, 2, 1], which will be an error in a future version of pytest. Did you mean to use assert instead of return?
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================= slowest 10 durations =============================
8.08s call test_python_apps/test_simple.py::test_slow_increment
8.04s call test_python_apps/test_simple.py::test_increment
6.04s call test_python_apps/test_join.py::test_combine
6.03s call test_python_apps/test_pipeline.py::test_increment_slow[3]
6.03s call test_python_apps/test_pipeline.py::test_increment[3]
4.09s call test_python_apps/test_futures.py::test_fut_case_3
4.05s call test_python_apps/test_overview.py::test_1
4.04s call test_python_apps/test_futures.py::test_fut_case_4
4.03s call test_docs/test_from_slides.py::test_slides
4.03s call test_python_apps/test_type5.py::test_func_linked_to_multiple_previous_results
=========================== short test summary info ============================
SKIPPED [1] parsl/tests/configs/ec2_single_node.py:30: 'public_ip' not configured in user_opts.py
SKIPPED [1] parsl/tests/configs/local_threads_globus.py:14: 'globus' not configured in user_opts.py
SKIPPED [1] parsl/tests/test_error_handling/test_retry_handler.py:41: intended for local config
SKIPPED [1] parsl/tests/test_flux.py:47: Flux not available, test will fail
SKIPPED [1] parsl/tests/test_flux.py:120: intended for local config
SKIPPED [1] parsl/tests/test_flux.py:108: intended for local config
SKIPPED [1] parsl/tests/test_flux.py:165: intended for local config
SKIPPED [1] parsl/tests/test_flux.py:152: intended for local config
SKIPPED [1] parsl/tests/test_flux.py:58: Flux not available, test will fail
SKIPPED [1] parsl/tests/test_flux.py:34: Flux not available, test will fail
SKIPPED [1] parsl/tests/test_flux.py:92: intended for local config
SKIPPED [1] parsl/tests/test_flux.py:71: Flux not available, test will fail
SKIPPED [1] parsl/tests/test_flux.py:135: intended for local config
SKIPPED [1] parsl/tests/test_error_handling/test_htex_missing_worker.py:27: intended for local config
SKIPPED [1] parsl/tests/test_checkpointing/test_python_checkpoint_2.py:24: intended for local config
SKIPPED [1] parsl/tests/test_staging/test_docs_2.py:11: intended for local config
SKIPPED [1] parsl/tests/test_checkpointing/test_regression_232.py:16: fails on Travis in pytest
SKIPPED [1] parsl/tests/test_checkpointing/test_regression_232.py:57: fails on Travis in pytest
SKIPPED [1] parsl/tests/site_tests/test_provider.py:20: This test cannot run on sites which cannot be identified by site_config_selector
SKIPPED [1] parsl/tests/test_docs/test_workflow2.py:19: fails with pytest+xdist
SKIPPED [1] parsl/tests/test_error_handling/test_wrap_with_logs.py:27: intended for local config
SKIPPED [1] parsl/tests/test_error_handling/test_wrap_with_logs.py:17: intended for local config
SKIPPED [1] parsl/tests/test_staging/test_elaborate_noop_file.py:64: intended for local config
SKIPPED [1] parsl/tests/test_staging/test_elaborate_noop_file.py:34: intended for local config
SKIPPED [1] parsl/tests/sites/test_affinity.py:43: intended for local config
SKIPPED [1] parsl/tests/test_thread_parallelism.py:24: fails intermittently in pytest
SKIPPED [1] parsl/tests/test_thread_parallelism.py:45: fails intermittently in pytest
SKIPPED [1] parsl/tests/test_staging/test_staging_https_in_task.py:19: intended for local config
SKIPPED [1] parsl/tests/test_regression/test_1606_wait_for_current_tasks.py:15: intended for local config
SKIPPED [1] parsl/tests/test_regression/test_854.py:39: intended for local config
SKIPPED [1] parsl/tests/test_regression/test_854.py:19: intended for local config
SKIPPED [1] parsl/tests/test_serialization/test_2555_caching_deserializer.py:12: intended for local config
SKIPPED [1] parsl/tests/test_staging/test_1316.py:43: intended for local config
SKIPPED [1] parsl/tests/test_staging/test_1316.py:66: intended for local config
SKIPPED [1] parsl/tests/test_scaling/test_scale_down.py:63: intended for local config
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
= 1 failed, 37 passed, 35 skipped, 30 deselected, 2 warnings in 101.26s (0:01:41) =
make: *** [Makefile:73: wqex_local_test] Error 1
Error: Process completed with exit code 2.

To Reproduce

I can get a similar looking exception by setting the test timeout to 0, on parsl master d5db72f

-@pytest.mark.parametrize("timeout_dur", (0.005, 0.01, 0.1))
+@pytest.mark.parametrize("timeout_dur", (0.000000,) )

Expected behavior
test should pass. if this is breaking behaviour for actual user code, rather than incorrect test expectations, it should work right for users too

Environment
initial observation CI of a dev branch. reproduction on my laptop.

Ben Clifford · Answer 1 · Fri Jul 21 2023 22:01:02 GMT+0800 (China Standard Time)

#2841 should fix this test