ask / mode

Python AsyncIO Services

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

threads.ServiceThread hangs on exception

r313pp opened this issue · comments

Checklist

  • I have included information about relevant versions
  • I have verified that the issue persists when using the master branch of Mode.

Steps to reproduce

import asyncio
import mode.threads

class CrashingServiceThread(mode.threads.ServiceThread):
    async def on_start(self):
        raise RuntimeError('I am here to crash')

async def main():
    await CrashingServiceThread().start()

if __name__ == "__main__":
    asyncio.run(main())

Expected behavior

Process exits.

Actual behavior

Process hangs.

Full traceback

'CrashingServiceThread' crashed: RuntimeError('I am here to crash')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/mode/threads.py", line 218, in _serve
    await self._default_start()
  File "/usr/local/lib/python3.7/site-packages/mode/services.py", line 739, in _default_start
    await self._actually_start()
  File "/usr/local/lib/python3.7/site-packages/mode/services.py", line 756, in _actually_start
    await self.on_start()
  File "app2.py", line 6, in on_start
    raise RuntimeError('I am here to crash')
RuntimeError: I am here to crash
--> Hangs here. ^C is pressed.
Traceback (most recent call last):
  File "app2.py", line 12, in <module>
    asyncio.run(main())
  File "/usr/local/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/usr/local/lib/python3.7/asyncio/base_events.py", line 570, in run_until_complete
    self.run_forever()
  File "/usr/local/lib/python3.7/asyncio/base_events.py", line 538, in run_forever
    self._run_once()
  File "/usr/local/lib/python3.7/asyncio/base_events.py", line 1746, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/local/lib/python3.7/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
--> Another ^C.
Exception ignored in: <module 'threading' from '/usr/local/lib/python3.7/threading.py'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 1307, in _shutdown
    lock.acquire()
KeyboardInterrupt

Versions

  • Python 3.7.4
  • Mode 4.1.6
  • Debian GNU/Linux 10 (buster), Windows 8.1

Additional Info

mode.threads.ServiceThread and mode.threads.WorkerThread locks each other when an exception is raised.
I've dug up the problem, but I don't have sufficient knowledge of this lib to fix this properly.
Hire is what I've found (my comments are denoted by ###):

class WorkerThread(threading.Thread):
    def run(self) -> None:
        try:
            self.service._start_thread()
        finally:
            ### self._is_stopped.set() done hire
            ### but WorkerThread.stop() is called inside self.service._start_thread()
            ### and infinitely waits, so we never reach this
            self._set_stopped()

    def stop(self) -> None:
        self._is_stopped.wait()  ### we infinitely wait here because run() is still running up the stack
        if self.is_alive():
            self.join(threading.TIMEOUT_MAX)

class ServiceThread(Service):
    def _start_thread(self) -> None:
        # set the default event loop for this thread
        asyncio.set_event_loop(self.thread_loop)
        try:
            self.thread_loop.run_until_complete(self._serve())
        except Exception:
            # if self._serve raises an exception we need to set
            # shutdown here, since _shutdown_thread will not execute.
            ### I think this is not true, _shutdown_thread is executed
            ### in case of the exception in _serve.
            self.set_shutdown()
            raise

    async def _shutdown_thread(self) -> None:
        await self._default_stop_children()
        await self.on_thread_stop()
        self.set_shutdown()
        await self._default_stop_futures()
        if self._thread is not None:
            ### problem is here
            ### May be we shouldn't call this from inside the thread
            self._thread.stop()
        await self._default_stop_exit_stacks()

    async def _serve(self) -> None:
        try:
            # start the service
            await self._default_start()
            # allow ServiceThread.start() to return
            # when wait_for_thread is enabled.
            await self.on_thread_started()
            notify(self._thread_running)
            await self.wait_until_stopped()
        except asyncio.CancelledError:
            raise
        except BaseException as exc:  # pylint: disable=broad-except
            self.on_crash('{0!r} crashed: {1!r}', self.label, exc)
            await self.crash(exc)
            if self.beacon.root is not None:
                await self.beacon.root.data.crash(exc)
            raise
        finally:
            await self._shutdown_thread()  ### this is called in case of the exception

It also causes faust to hang as well:

import faust
import mode

app = faust.App(
    "some_app",
    broker="kafka://localhost:9092",
    # agent_supervisor=mode.CrashingSupervisor,  # This won't help
)

@app.task
async def some_task():
    raise RuntimeError("Some Error")

if __name__ == "__main__":
    app.main()

Process will hang in crashed state.

Also, it seems to be causing this faust issue.

Also broken on: 3.8.0, using just the builtin examples, on master:

jbooth@m /home/jbooth/oss/mode (master) $ python examples/tutorial.py
[2020-01-02 10:54:16,276] [16560] [INFO] [^Worker]: Starting... 
[2020-01-02 10:54:16,276] [16560] [INFO] [^-App]: Starting... 
[2020-01-02 10:54:16,333] [16560] [WARNING] WRITING GRAPH TO image.png 
[2020-01-02 10:54:16,354] [16560] [INFO] [^--Websockets]: Starting... 
[2020-01-02 10:54:16,354] [16560] [WARNING] STARTING WEBSOCKET SERVER 
[2020-01-02 10:54:16,354] [16560] [INFO] [^--Webserver]: Starting... 
[2020-01-02 10:54:16,355] [16560] [WARNING] examples/tutorial.py:53: DeprecationWarning: Application.make_handler(...) is deprecated, use AppRunner API instead
  handler = self._handler = self._app.make_handler() 
[2020-01-02 10:54:16,356] [16560] [INFO] [^--Webserver]: Serving on port 8000 
[2020-01-02 10:54:17,357] [16560] [INFO] [^--UserCache]: Starting... 
^C-INT- -INT- -INT- -INT- -INT- -INT-
[2020-01-02 10:54:25,363] [16560] [INFO] [^Worker]: Signal received: Signals.SIGINT (2) 
[2020-01-02 10:54:25,364] [16560] [INFO] [^Worker]: Stopping... 
[2020-01-02 10:54:25,364] [16560] [INFO] [^-App]: Stopping... 
[2020-01-02 10:54:25,365] [16560] [INFO] [^--UserCache]: Stopping... 
[2020-01-02 10:54:25,366] [16560] [INFO] [^--Webserver]: Stopping... 
[2020-01-02 10:54:25,372] [16560] [INFO] [^--Webserver]: Closing server 
[2020-01-02 10:54:25,372] [16560] [INFO] [^--Webserver]: Waiting for server to close handle 
[2020-01-02 10:54:25,373] [16560] [INFO] [^--Webserver]: Shutting down web application 
[2020-01-02 10:54:25,373] [16560] [INFO] [^--Webserver]: Waiting for handler to shut down 
[2020-01-02 10:54:25,374] [16560] [INFO] [^--Webserver]: Cleanup 
^C-INT- -INT- -INT- -INT- -INT- -INT-
[2020-01-02 10:55:17,589] [16560] [INFO] [^--Websockets]: Stopping... 
[2020-01-02 10:55:17,589] [16560] [INFO] [^Worker]: Gathering service tasks... 
[2020-01-02 10:55:17,589] [16560] [INFO] [^Worker]: Gathering all futures... 
[2020-01-02 10:55:18,591] [16560] [INFO] [^Worker]: Closing event loop 
^CException ignored in: <module 'threading' from '/home/jbooth/.asdf/installs/python/3.8.0/lib/python3.8/threading.py'>
Traceback (most recent call last):
  File "/home/jbooth/.asdf/installs/python/3.8.0/lib/python3.8/threading.py", line 1388, in _shutdown
    lock.acquire()
KeyboardInterrupt: 
jbooth@m /home/jbooth/oss/mode (master) $ 

Also broken in python 3.6.9, but it exits fine in python 3.6.3. Aha! Going to trace that down to a single python version.

3.6.6 passes. 3.6.7 fails, requiring two ctrl-C's. 3.6.7 passes if I let it sit and idle long enough before the ctrl-C?

3.6.7 is heisenbug. Sometimes ctrl-C exits cleanly. Sometimes not. This is true back to at least 3.6.3.

Okay, having walked away in the middle of a test, here's what I've learned: there's two branch points for what looks like failure.

An early ctrl-C will result in a hanging process, for exactly one minute. In general, in the example, if you let it run until it prints the REMOVING EXPIRED USERS line and quickly ctrl-C you can avoid the first hang. If you hit the first hang it is between these two:

jbooth@m /home/jbooth/oss/mode (master) $ python --version;python examples/tutorial.py
Python 3.7.4
[2020-01-02 12:42:47,116] [10916] [INFO] [^Worker]: Starting... 
[2020-01-02 12:42:47,116] [10916] [INFO] [^-App]: Starting... 
[2020-01-02 12:42:47,178] [10916] [WARNING] WRITING GRAPH TO image.png 
[2020-01-02 12:42:47,198] [10916] [INFO] [^--Websockets]: Starting... 
[2020-01-02 12:42:47,198] [10916] [WARNING] STARTING WEBSOCKET SERVER 
[2020-01-02 12:42:47,199] [10916] [INFO] [^--Webserver]: Starting... 
[2020-01-02 12:42:47,199] [10916] [WARNING] examples/tutorial.py:53: DeprecationWarning: Application.make_handler(...) is deprecated, use AppRunner API instead
  handler = self._handler = self._app.make_handler() 
[2020-01-02 12:42:47,201] [10916] [INFO] [^--Webserver]: Serving on port 8000 
^C[2020-01-02 12:42:47,800] [10916] [INFO] [^--UserCache]: Starting... 
-INT- -INT- -INT- -INT- -INT- -INT-
[2020-01-02 12:42:47,801] [10916] [INFO] [^Worker]: Signal received: Signals.SIGINT (2) 
[2020-01-02 12:42:47,802] [10916] [INFO] [^Worker]: Stopping... 
[2020-01-02 12:42:47,802] [10916] [INFO] [^-App]: Stopping... 
[2020-01-02 12:42:47,803] [10916] [INFO] [^--UserCache]: Stopping... 
[2020-01-02 12:42:47,804] [10916] [INFO] [^--Webserver]: Stopping... 
[2020-01-02 12:42:48,203] [10916] [INFO] [^--Webserver]: Closing server 
[2020-01-02 12:42:48,203] [10916] [INFO] [^--Webserver]: Waiting for server to close handle 
[2020-01-02 12:42:48,204] [10916] [INFO] [^--Webserver]: Shutting down web application 
[2020-01-02 12:42:48,204] [10916] [INFO] [^--Webserver]: Waiting for handler to shut down 
[2020-01-02 12:42:48,205] [10916] [INFO] [^--Webserver]: Cleanup 
... a minute passes ...
[2020-01-02 12:43:47,865] [10916] [INFO] [^--Websockets]: Stopping... 

After that timeout, there's a second failure point, based around the threading changes in 3.7.4. In 3.7.4 and later with those changes, we hang as the interpreter is trying to exit:

[2020-01-02 12:43:47,865] [10916] [INFO] [^--Websockets]: Stopping... 
[2020-01-02 12:43:47,866] [10916] [INFO] [^Worker]: Gathering service tasks... 
[2020-01-02 12:43:47,866] [10916] [INFO] [^Worker]: Gathering all futures... 
[2020-01-02 12:43:48,869] [10916] [INFO] [^Worker]: Closing event loop 
... the interpreter hangs (hard) here ...

and then hang what looks like indefinitely (multi-minute) until a second ctrl-C is thrown, at which point the exception in the threading module happens:

[2020-01-02 12:43:47,865] [10916] [INFO] [^--Websockets]: Stopping... 
[2020-01-02 12:43:47,866] [10916] [INFO] [^Worker]: Gathering service tasks... 
[2020-01-02 12:43:47,866] [10916] [INFO] [^Worker]: Gathering all futures... 
[2020-01-02 12:43:48,869] [10916] [INFO] [^Worker]: Closing event loop 
^CException ignored in: <module 'threading' from '/home/jbooth/.asdf/installs/python/3.7.4/lib/python3.7/threading.py'>
Traceback (most recent call last):
  File "/home/jbooth/.asdf/installs/python/3.7.4/lib/python3.7/threading.py", line 1308, in _shutdown
    lock.acquire()
KeyboardInterrupt
jbooth@m /home/jbooth/oss/mode (master) $ date
Thu 02 Jan 2020 12:50:10 PM CST

Debugging this is super confusing in part because mode's WorkerThread is overriding threading.Thread's _is_stopped attribute. This means if you look at the thread state in the repl/pdb, it will erroneously report the thread as stopped because self._is_stopped is a threading.Event which is not falsey. That seems to make WorkerThread.stop incorrect -- self.is_alive() is always false, so it'll never execute the self.join()

Trying to get better debugging by running the example under python -X dev examples/tutorial.py I can't even get it launched. Python throws RuntimeErrors about non-thread-safe asyncio loop usage.

After fixing the thread safety issues, it tentatively looks like the thread_loop isn't getting properly stopped and closed. When it isn't closed, the thread doesn't properly stop. The unstopped thread prevents interpreter exit. This doesn't fix @r313pp's crashing example (although it also not properly exits post-crash via ctrl-c!), but it does fix the tutorial so it properly exits on ctrl-c. It is causing test failures I haven't yet investigated. I'll try to hit both of those, probably next week, and get a branch and PR up.

diff --git a/mode/threads.py b/mode/threads.py
index 4a5c7ca..635d5c8 100644
--- a/mode/threads.py
+++ b/mode/threads.py
@@ -189,6 +189,9 @@ class ServiceThread(Service):
             # shutdown here, since _shutdown_thread will not execute.
             self.set_shutdown()
             raise
+        finally:
+            self.loop.stop()
+            self.loop.close()
 
     async def stop(self) -> None:
         if self._started.is_set():
@@ -208,8 +211,6 @@ class ServiceThread(Service):
         await self.on_thread_stop()
         self.set_shutdown()
         await self._default_stop_futures()
-        if self._thread is not None:
-            self._thread.stop()
         await self._default_stop_exit_stacks()
 
     async def _serve(self) -> None:
diff --git a/mode/utils/futures.py b/mode/utils/futures.py
index 0f8e45a..1e280d5 100644
--- a/mode/utils/futures.py
+++ b/mode/utils/futures.py
@@ -155,7 +155,7 @@ def maybe_set_result(fut: Optional[asyncio.Future],
                      result: Any) -> bool:
     """Set future result if not already done."""
     if fut is not None and not fut.done():
-        fut.set_result(result)
+        notify(fut, result)
         return True
     return False
 
@@ -165,4 +165,5 @@ def notify(fut: Optional[asyncio.Future], result: Any = None) -> None:
     # can be used to turn a Future into a lockless, single-consumer condition,
     # for multi-consumer use asyncio.Condition
     if fut is not None and not fut.done():
-        fut.set_result(result)
+        loop = fut.get_loop()
+        loop.call_soon_threadsafe(fut.set_result, result)
diff --git a/mode/utils/locks.py b/mode/utils/locks.py
index 5c42ee4..eba6e64 100644
--- a/mode/utils/locks.py
+++ b/mode/utils/locks.py
@@ -45,12 +45,13 @@ class Event:
         All coroutines waiting for it to become true are awakened.
         Coroutine that call wait() once the flag is true will not block at all.
         """
+        from .futures import notify
+
         if not self._value:
             self._value = True
 
-            for fut in self._waiters:
-                if not fut.done():
-                    fut.set_result(True)
+            for fut in self._waiters.copy():
+                notify(fut, True)
 
     def clear(self) -> None:
         """Reset the internal flag to false.

@jbooth-mastery have you tried my PR #55?
Although I haven't addressed correct loop closing problem(may be I'll do it by the end of this week), but threadsafe async calls and thread hang are fixed, at least in my cases.

I'd missed your branch. It doesn't seem to work correctly for me:

jbooth@m /home/jbooth/oss/mode (service-thread-hangs-fix) $ python --version;python -X dev -W default examples/tutorial.py 
Python 3.8.0
/home/jbooth/oss/mode/.direnv/python-3.8.0/lib/python3.8/site-packages/aiohttp/helpers.py:107: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead
  def noop(*args, **kwargs):  # type: ignore
[2020-01-06 09:29:17,035] [13861] [INFO] [^Worker]: Starting... 
[2020-01-06 09:29:17,036] [13861] [INFO] [^-App]: Starting... 
[2020-01-06 09:29:17,110] [13861] [WARNING] WRITING GRAPH TO image.png 
[2020-01-06 09:29:17,136] [13861] [INFO] [^--Websockets]: Starting... 
[2020-01-06 09:29:17,136] [13861] [WARNING] STARTING WEBSOCKET SERVER 
[2020-01-06 09:29:17,137] [13861] [WARNING] Executing <Task pending name='Task-1' coro=<Service.start() running at /home/jbooth/oss/mode/mode/services.py:732> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7fcc01a2d510>()] created at /home/jbooth/oss/mode/mode/threads.py:160> cb=[_run_until_complete_cb() at /home/jbooth/.asdf/installs/python/3.8.0/lib/python3.8/asyncio/base_events.py:180] created at /home/jbooth/oss/mode/mode/worker.py:263> took 0.103 seconds 
[2020-01-06 09:29:17,138] [13861] [INFO] [^--Webserver]: Starting... 
[2020-01-06 09:29:17,138] [13861] [WARNING] examples/tutorial.py:53: DeprecationWarning: Application.make_handler(...) is deprecated, use AppRunner API instead
  handler = self._handler = self._app.make_handler() 
[2020-01-06 09:29:17,141] [13861] [INFO] <Server sockets=(<asyncio.TransportSocket fd=9, family=AddressFamily.AF_INET6, type=SocketKind.SOCK_STREAM, proto=6, laddr=('::', 8000, 0, 0)>, <asyncio.TransportSocket fd=10, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 8000)>)> is serving 
[2020-01-06 09:29:17,141] [13861] [INFO] [^--Webserver]: Serving on port 8000 
[2020-01-06 09:29:17,141] [13861] [WARNING] /home/jbooth/oss/mode/mode/services.py:719: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
  done, pending = await asyncio.wait( 
[2020-01-06 09:29:17,142] [13861] [INFO] [^--UserCache]: Starting... 
[2020-01-06 09:29:17,142] [13861] [WARNING] /home/jbooth/oss/mode/mode/services.py:625: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
  await asyncio.wait_for( 
^C-INT- -INT- -INT- -INT- -INT- -INT-
[2020-01-06 09:29:23,735] [13861] [INFO] [^Worker]: Signal received: Signals.SIGINT (2) 
[2020-01-06 09:29:23,736] [13861] [INFO] [^Worker]: Stopping... 
[2020-01-06 09:29:23,737] [13861] [INFO] [^-App]: Stopping... 
[2020-01-06 09:29:23,738] [13861] [INFO] [^--UserCache]: Stopping... 
[2020-01-06 09:29:23,740] [13861] [WARNING] /home/jbooth/oss/mode/mode/services.py:903: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
  await asyncio.wait( 
[2020-01-06 09:29:23,746] [13861] [INFO] [^--Webserver]: Stopping... 
[2020-01-06 09:29:23,747] [13861] [ERROR] [^-App]: Error while stopping child <Webserver: stopping >: RuntimeError('Non-thread-safe operation invoked on an event loop other than the current one') 
Traceback (most recent call last):
  File "/home/jbooth/oss/mode/mode/services.py", line 854, in _default_stop_children
    await asyncio.shield(child.stop())
  File "/home/jbooth/oss/mode/mode/threads.py", line 196, in stop
    await super().stop()
  File "/home/jbooth/oss/mode/mode/services.py", line 832, in stop
    self._stopped.set()
  File "/home/jbooth/oss/mode/mode/utils/locks.py", line 53, in set
    fut.set_result(True)
  File "/home/jbooth/.asdf/installs/python/3.8.0/lib/python3.8/asyncio/base_events.py", line 713, in call_soon
    self._check_thread()
  File "/home/jbooth/.asdf/installs/python/3.8.0/lib/python3.8/asyncio/base_events.py", line 750, in _check_thread
    raise RuntimeError(
RuntimeError: Non-thread-safe operation invoked on an event loop other than the current one
[2020-01-06 09:29:23,749] [13861] [INFO] [^--Websockets]: Stopping... 
[2020-01-06 09:29:23,751] [13861] [INFO] [^Worker]: Gathering service tasks... 
[2020-01-06 09:29:23,751] [13861] [INFO] [^Worker]: Gathering all futures... 
/home/jbooth/oss/mode/mode/worker.py:318: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
  await asyncio.sleep(1.0, loop=self.loop)
[2020-01-06 09:29:24,753] [13861] [INFO] [^Worker]: Closing event loop 
... process hangs ...

I'm going to try to fix the tests I broke and then will look at merging your changes into my branch and will put it up as a PR. I'm hoping to get that all done today or tomorrow as this bug is a blocker for my team.

Thanks for looking into this! Clearly a bug

I merged #55 for now as that one is easier to reason about, but hope to merge #56 as well

I'll update the conflicts out of it and let's see how it looks?