triton-lang / triton

Development repository for the Triton language and compiler

Home Page:https://triton-lang.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

StackTrace handler on python module does not allow signal to propagate.

amjames opened this issue · comments

#4094 Connected the llvm handler for printing stack traces on fatal signals to the python module. Importing triton will make the handler active and it does not allow the signal to propagate correctly. This happens as long as triton is imported, the signal need not originate from triton code.

Reproducer and notes below:

`python repro.py` will kill the script's process with `SIGTERM` which is not intercepted by the handler.

python repro.py abort will kill the script's process with SIGABRT which is interecepted by the handler. A stacktrace is printed, but the process does not exit, the print("still alive") line will execute and the script exits with an exit code 0.

repro.py:

import signal
import os
import triton
import sys

if sys.argv[-1] == 'abort':
    os.kill(os.getpid(), signal.SIGABRT)
else:
    os.kill(os.getpid(), signal.SIGTERM)

print('Still alive')

I have confirmed locally that commenting out this line restores normal propagation of these signals.

You're totally right and this is not nice of us.

I actually don't see where in the LLVM signal handler we prevent propagation of the signal. Here's the signal handler on unix. https://github.com/llvm/llvm-project/blob/8d3ff601a307fa9d18f237903b298bb12b8b64cf/llvm/lib/Support/Unix/Signals.inc#L371

In any case the fix here appears to be somewhere in LLVM, and with regret I don't have cycles right now to investigate further. I'd take a patch to add an envvar that causes us not to register a signal handler.

I think the issue is that the LLVM signal handler takes precedence over the Python signal handler. I created my own signal handler function that calls llvm::sys::PrintStackTrace(llvm::errs()); and then uses the python C API to set an interrupt - this results in the program terminating after the stack trace is printed:

Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
0  libtriton.so 0x00007a065c21e970
1  libtriton.so 0x00007a0657fc3437
2  libtriton.so 0x00007a065c21bd7f
3  libtriton.so 0x00007a065c21bed5
4  libc.so.6    0x00007a065fe42520
5  libc.so.6    0x00007a065fe4275b kill + 11
6  python       0x00005bf615df8004
7  python       0x00005bf615cd7c59
8  python       0x00005bf615cc5cfa _PyEval_EvalFrameDefault + 24906
9  python       0x00005bf615cbc9c6
10 python       0x00005bf615db2256 PyEval_EvalCode + 134
11 python       0x00005bf615ddd108
12 python       0x00005bf615dd69cb
13 python       0x00005bf615ddce55
14 python       0x00005bf615ddc338 _PyRun_SimpleFileObject + 424
15 python       0x00005bf615ddbf83 _PyRun_AnyFileObject + 67
16 python       0x00005bf615dcea5e Py_RunMain + 702
17 python       0x00005bf615da502d Py_BytesMain + 45
18 libc.so.6    0x00007a065fe29d90
19 libc.so.6    0x00007a065fe29e40 __libc_start_main + 128
20 python       0x00005bf615da4f25 _start + 37
Traceback (most recent call last):
  File "/localdisk/abaden/Projects/triton/repro.py", line 7, in <module>
    os.kill(os.getpid(), signal.SIGABRT)
KeyboardInterrupt

There is a downside - the original signal gets swapped for a SIGINT. There is a signal specific interrupt you can call, but only supported in Python 3.10. That also assumes the original signal can be easily captured where I have intercepted the llvm stack - I tried to duplicate the minimal amount of code, but it means I don't have a lot of the original context.

I can PR this change if it sounds like it might be acceptable.

I created my own signal handler function that calls llvm::sys::PrintStackTrace(llvm::errs());

If this is in fact signal-safe, then sgtm.

That also assumes the original signal can be easily captured where I have intercepted the llvm stack - I tried to duplicate the minimal amount of code, but it means I don't have a lot of the original context.

Not sure wym

Not sure wym

LLVM registers the signal handlers internally, but I don't think the signal is actually passed to the handler function. We could write our own C signal handler but using LLVM is nicer because we know it is thread safe.

OK, happy to take a look at the PR. It would indeed be better to print both the Python and C++ stacks instead of having to pick one.