Segmentation fault with ITIMER_REAL

Question

Segmentation fault with ITIMER_REAL

sternj opened this issue 3 years ago · comments

sternj commented 3 years ago

🐛 Bug

PyTorch throws SIGSEGV when running alongside timer on MacOS x86

To Reproduce

Steps to reproduce the behavior:

Run code located here on Mac x86

Here is the stack trace from the crashed thread:

Thread 6 Crashed:
0   ???                           	0x00007ffeee6d7138 0 + 140732898570552
1   libtorch_cpu.dylib            	0x000000010392478c at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long long const*, long long, long long)>, at::Range) const + 588
2   libtorch_cpu.dylib            	0x000000010390cdf2 std::__1::__function::__func<at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)::$_1, std::__1::allocator<at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)::$_1>, void (int, unsigned long)>::operator()(int&&, unsigned long&&) + 114
3   libtorch_cpu.dylib            	0x000000010390b7ca std::__1::__function::__func<at::(anonymous namespace)::_run_with_pool(std::__1::function<void (int, unsigned long)> const&, unsigned long)::$_3, std::__1::allocator<at::(anonymous namespace)::_run_with_pool(std::__1::function<void (int, unsigned long)> const&, unsigned long)::$_3>, void ()>::operator()() + 42
4   libc10.dylib                  	0x00000001020996c9 c10::ThreadPool::main_loop(unsigned long) + 569
5   libc10.dylib                  	0x0000000102099d43 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67
6   libsystem_pthread.dylib       	0x00007fff5a16a2eb _pthread_body + 126
7   libsystem_pthread.dylib       	0x00007fff5a16d249 _pthread_start + 66
8   libsystem_pthread.dylib       	0x00007fff5a16940d thread_start + 13

Expected behavior

Either the program should run without issue or should pass up the SIGALRM.

Environment

Collecting environment information...
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 10.14.6 (x86_64)
GCC version: Could not collect
Clang version: 11.0.0 (clang-1100.0.33.12)
CMake version: Could not collect

Python version: 3.9 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.8.1
[conda] Could not collect

Emery Berger · Answer 1 · Thu Apr 29 2021 07:53:25 GMT+0800 (China Standard Time)

Reproduced on my machine as well. I get no segfaults when I run with version 1.5.1, but with 1.8.1, it segfaults on most executions.

Collecting environment information...
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 11.2.3 (x86_64)
GCC version: Could not collect
Clang version: 12.0.0 (clang-1200.0.32.29)
CMake version: version 3.19.1

Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.8.1
[conda] Could not collect

Hoàng Tùng Lâm (Linus) · Answer 2 · Wed Jul 21 2021 15:57:33 GMT+0800 (China Standard Time)

I'm having the same problem on pytorch 1.8.1 as well

ananyajoshi2301 · Answer 3 · Wed Sep 08 2021 17:26:39 GMT+0800 (China Standard Time)

Receied Scalene error: received signal SIGSEGV when using Tensorflow.
Attaching the code for reference:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Version of tensorflow being used with Python 3.8:
tensorflow==2.6.0

SolomidHero · Answer 4 · Tue Jul 26 2022 05:09:29 GMT+0800 (China Standard Time)

same problem, any update?

Christian Hudon · Answer 5 · Wed Aug 17 2022 22:46:13 GMT+0800 (China Standard Time)

I could potentially have a look at this, but I don't have a lot of experience with the Pytorch codebase. It'd be lovely if someone with more experience there could point us in the right direction, at least.

Emery Berger · Answer 6 · Thu Aug 18 2022 09:08:23 GMT+0800 (China Standard Time)

FWIW this is now working for me.

Collecting environment information...
PyTorch version: 1.13.0.dev20220521
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.5 (arm64)
GCC version: Could not collect
Clang version: 13.1.6 (clang-1316.0.21.2.5)
CMake version: version 3.23.2
Libc version: N/A

Python version: 3.9.13 (main, May 24 2022, 21:13:51)  [Clang 13.1.6 (clang-1316.0.21.2)] (64-bit runtime)
Python platform: macOS-12.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.920
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.1
[pip3] torch==1.13.0.dev20220521
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0

thiagodaedalus · Answer 7 · Mon Apr 24 2023 21:41:57 GMT+0800 (China Standard Time)

Same problem.

Oguzhan Gencoglu · Answer 8 · Sun Nov 12 2023 18:40:04 GMT+0800 (China Standard Time)

Same issue with pytorch 2.0.0, python 3.11 on Mac