pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Home Page:https://pytorch.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation fault with ITIMER_REAL

sternj opened this issue · comments

🐛 Bug

PyTorch throws SIGSEGV when running alongside timer on MacOS x86

To Reproduce

Steps to reproduce the behavior:

  1. Run code located here on Mac x86

Here is the stack trace from the crashed thread:

Thread 6 Crashed:
0   ???                           	0x00007ffeee6d7138 0 + 140732898570552
1   libtorch_cpu.dylib            	0x000000010392478c at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long long const*, long long, long long)>, at::Range) const + 588
2   libtorch_cpu.dylib            	0x000000010390cdf2 std::__1::__function::__func<at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)::$_1, std::__1::allocator<at::internal::_parallel_run(long long, long long, long long, std::__1::function<void (long long, long long, unsigned long)> const&)::$_1>, void (int, unsigned long)>::operator()(int&&, unsigned long&&) + 114
3   libtorch_cpu.dylib            	0x000000010390b7ca std::__1::__function::__func<at::(anonymous namespace)::_run_with_pool(std::__1::function<void (int, unsigned long)> const&, unsigned long)::$_3, std::__1::allocator<at::(anonymous namespace)::_run_with_pool(std::__1::function<void (int, unsigned long)> const&, unsigned long)::$_3>, void ()>::operator()() + 42
4   libc10.dylib                  	0x00000001020996c9 c10::ThreadPool::main_loop(unsigned long) + 569
5   libc10.dylib                  	0x0000000102099d43 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67
6   libsystem_pthread.dylib       	0x00007fff5a16a2eb _pthread_body + 126
7   libsystem_pthread.dylib       	0x00007fff5a16d249 _pthread_start + 66
8   libsystem_pthread.dylib       	0x00007fff5a16940d thread_start + 13

Expected behavior

Either the program should run without issue or should pass up the SIGALRM.

Environment

Collecting environment information...
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 10.14.6 (x86_64)
GCC version: Could not collect
Clang version: 11.0.0 (clang-1100.0.33.12)
CMake version: Could not collect

Python version: 3.9 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] torch==1.8.1
[conda] Could not collect

Reproduced on my machine as well. I get no segfaults when I run with version 1.5.1, but with 1.8.1, it segfaults on most executions.

Collecting environment information...
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 11.2.3 (x86_64)
GCC version: Could not collect
Clang version: 12.0.0 (clang-1200.0.32.29)
CMake version: version 3.19.1

Python version: 3.6 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.8.1
[conda] Could not collect

I'm having the same problem on pytorch 1.8.1 as well

Receied Scalene error: received signal SIGSEGV when using Tensorflow.
Attaching the code for reference:

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Version of tensorflow being used with Python 3.8:
tensorflow==2.6.0

same problem, any update?

I could potentially have a look at this, but I don't have a lot of experience with the Pytorch codebase. It'd be lovely if someone with more experience there could point us in the right direction, at least.

FWIW this is now working for me.

Collecting environment information...
PyTorch version: 1.13.0.dev20220521
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.5 (arm64)
GCC version: Could not collect
Clang version: 13.1.6 (clang-1316.0.21.2.5)
CMake version: version 3.23.2
Libc version: N/A

Python version: 3.9.13 (main, May 24 2022, 21:13:51)  [Clang 13.1.6 (clang-1316.0.21.2)] (64-bit runtime)
Python platform: macOS-12.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.920
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.1
[pip3] torch==1.13.0.dev20220521
[pip3] torchaudio==0.11.0
[pip3] torchvision==0.12.0

Same problem.

Same issue with pytorch 2.0.0, python 3.11 on Mac