Segmentation fault with ITIMER_REAL
sternj opened this issue · comments
🐛 Bug
PyTorch throws SIGSEGV when running alongside timer on MacOS x86
To Reproduce
Steps to reproduce the behavior:
- Run code located here on Mac x86
Here is the stack trace from the crashed thread:
Thread 6 Crashed:
0 ??? 0x00007ffeee6d7138 0 + 140732898570552
1 libtorch_cpu.dylib 0x000000010392478c at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long long const*, long long, long long)>, at::Range) const + 588
2 libtorch_cpu.dylib 0x000000010390cdf2 std::__1::__function::__func<at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)::$_1, std::__1::allocator<at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)::$_1>, void (int, unsigned long)>::operator()(int&&, unsigned long&&) + 114
3 libtorch_cpu.dylib 0x000000010390b7ca std::__1::__function::__func<at::(anonymous namespace)::_run_with_pool(std::__1::function<void (int, unsigned long)> const&, unsigned long)::$_3, std::__1::allocator<at::(anonymous namespace)::_run_with_pool(std::__1::function<void (int, unsigned long)> const&, unsigned long)::$_3>, void ()>::operator()() + 42
4 libc10.dylib 0x00000001020996c9 c10::ThreadPool::main_loop(unsigned long) + 569
5 libc10.dylib 0x0000000102099d43 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67
6 libsystem_pthread.dylib 0x00007fff5a16a2eb _pthread_body + 126
7 libsystem_pthread.dylib 0x00007fff5a16d249 _pthread_start + 66
8 libsystem_pthread.dylib 0x00007fff5a16940d thread_start + 13
Expected behavior
Either the program should run without issue or should pass up the SIGALRM.
Environment
Collecting environment information...
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 10.14.6 (x86_64)
GCC version: Could not collect
Clang version: 11.0.0 (clang-1100.0.33.12)
CMake version: Could not collect
Python version: 3.9 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.8.1
[conda] Could not collect
Reproduced on my machine as well. I get no segfaults when I run with version 1.5.1, but with 1.8.1, it segfaults on most executions.
Collecting environment information...
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 11.2.3 (x86_64)
GCC version: Could not collect
Clang version: 12.0.0 (clang-1200.0.32.29)
CMake version: version 3.19.1
Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.8.1
[conda] Could not collect
I'm having the same problem on pytorch 1.8.1 as well
Receied Scalene error: received signal SIGSEGV when using Tensorflow.
Attaching the code for reference:
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)
Version of tensorflow being used with Python 3.8:
tensorflow==2.6.0
same problem, any update?
I could potentially have a look at this, but I don't have a lot of experience with the Pytorch codebase. It'd be lovely if someone with more experience there could point us in the right direction, at least.
FWIW this is now working for me.
Collecting environment information...
PyTorch version: 1.13.0.dev20220521
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 12.5 (arm64)
GCC version: Could not collect
Clang version: 13.1.6 (clang-1316.0.21.2.5)
CMake version: version 3.23.2
Libc version: N/A
Python version: 3.9.13 (main, May 24 2022, 21:13:51) [Clang 13.1.6 (clang-1316.0.21.2)] (64-bit runtime)
Python platform: macOS-12.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy==0.920
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.1
[pip3] torch==1.13.0.dev20220521
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0
Same problem.
Same issue with pytorch 2.0.0, python 3.11 on Mac