Target.execute() not reentrant

Question

Target.execute() not reentrant

douglas-raillard-arm opened this issue 3 years ago · comments

Target.execute() is not reentrant when using (at least) the LocalConnection connection type.

This manifests as a deadlock in devlib.misc.get_subprocess(), when it tries to acquire the check_output_lock. This can happen when an object's __del__ uses target.execute() and happens to be garbage collected at the same time as Target.execute() is being called. The __del__ method is called at a random point where the lock is already taken, and the gc deadlocks on the lock. Since the gc is called from the main thread, the whole interpreter hangs.

In general, it's not a huge problem because __del__ should never be relied upon. However:

It's sometimes useful to do some last-resort cleanup in __del__, which this problem makes it impossible (failing to cleanup is ok, deadlocking not so much)
There is a useful idiom used to "package" setup/teardown code:

@contextlib.contextmanager
def setup_teardown():
     setup()
     try:
         yield
     finally:
          teardown()


with setup_teardown():
    ....

# But if we want to keep the CM as an implementation detail for backward compat and allow a manual setup()/teardown() API:

cm = setup_teardown()
setup = lambda: cm.__enter__()
teardown = lambda: cm.__exit__(None, None, None)

setup()
# May never happen at all if the user never calls it
teardown()

Now we have a problem: the user might never calls teardown(). Since the context manager is based on a generator, it will eventually be closed (manually or via __del__), leading to a GeneratorExit being raised at the "yield" point. When this happens, it will basically be as-if teardown() was called, except it can happen asynchronously at any point. Ideally, we would have a generator type that does not call close() from __del__, but there is nothing we can do for builtin types. The only "fix" is to keep a reference alive forever (and leak memory) so that they are not gc'ed before __exit__ is called.

That leaves a few choices:

Make devlib's Target.execute() reentrant
Make devlib's Target.execute() sort-of-reentrant (i.e. raise a RuntimeError. This will randomly prevent __del__ from doing useful things and litter the stderr with the backtraces of swallowed exceptions but at least won't deadlock).
Avoid calling target.execute() from __del__, which prevents useful cases.

Maybe connections are actually already reentrant and all we need is to turn the lock into a reentrant lock.

Otherwise, we can probably create a new connection object for nested execute() and background() calls, so connection objects don't have to worry about that issue. It should only happen in exceptional circumstances so I don't think performance is a problem. We already do it for multithreading so I expect it to work.

Another more invasive option is to implement Target.execute() on top of Target.background(), since the latter should already be reentrant. There might be some performance cost to that though, since background commands have to create their own channel for SSH so that multiple background commands can operate independently.

setrofim · Answer 1 · Tue Mar 23 2021 00:00:33 GMT+0800 (China Standard Time)

Hm wouldn't making check_output_lock an RLock just fix this? Or am misunderstanding the issue?

Douglas Raillard · Answer 2 · Tue Mar 23 2021 00:13:02 GMT+0800 (China Standard Time)

Hm wouldn't making check_output_lock an RLock just fix this?

Potentially, but I guess we should audit the rest of the code to make sure the no connection subclass assumes non-reentrancy. For example, I have no idea if paramiko.client.SSHClient.exec_command is reentrant either. The doc [1] seems to imply it's ok (as it creates a new channel for the call anyway), but I don't know how reliable that is. I don't think any "backend" function like SSHClient.exec_command will ever expect to be re-entered unless they take a user callback.

[1] http://docs.paramiko.org/en/stable/api/client.html#paramiko.client.SSHClient.exec_command

Douglas Raillard · Answer 3 · Wed May 12 2021 21:53:53 GMT+0800 (China Standard Time)

After a bit of research it turns out that paramiko would indeed have a similar problem: SSHClient.exec_command will call Transport.open_session(), itself calling Transport.open_channel(), which uses a threading.Lock.

Do you think creating a separate connection instance for nested calls is ok ? At first sight that seems to be the least effort path, as connection instances are completely independent and we already dynamically swap the content of the "conn" property to have a separate one per thread. This would provide a blanket fix that is unlikely to break with future changes.