[Regression] Likely data race after #161
roastduck opened this issue · comments
After #161, there seems to be a data race, which cause Driver
unable to be constructed concurrently.
PS. Driver
's constructor should be thread-safe, so we can call the backend compiler in parallel. Currently, all the function calls from Driver
's constructor are thread-safe.
To reproduce, change test/30.schedule/test_auto_fuse.py:114
(the second parameter of s.tune_auto_schedule
in the test test_tune
) from 1
to 50
. Run the test, and you will see an error message like
Inconsistency detected by ld.so: dl-open.c: 881: _dl_open: Assertion `_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT' failed!
This means some data structure inside dlopen
is corrupted.
It works one commit (52b1627) before #161, but fails since #161.
Valgrind's thread analyzer suggest that it may be some co-influences among fork
, dlopen
and OpenMP, but can be other reasons since there are a lot of false positive from Valgrind.
Here is one of the stack traces from Valgrind:
==624918== Thread #79 unlocked lock at 0x402B968 currently held by thread #73
==624918== at 0x48439F7: mutex_unlock_WRK (hg_intercepts.c:1130)
==624918== by 0x48473F4: pthread_mutex_unlock (hg_intercepts.c:1151)
==624918== by 0x4881438: dlsym (dlsym.c:72)
==624918== by 0x391FE513: __kmp_init_target_mem (in /home/spack/spack/opt/spack/linux-debian11-zen2/clang-13.0.1/llvm-13.0.0-skvgau7fl4xbckqfhnb5wqo3fd55fh2k/lib/libomp.so)
==624918== by 0x392128C5: __kmp_do_serial_initialize() (in /home/spack/spack/opt/spack/linux-debian11-zen2/clang-13.0.1/llvm-13.0.0-skvgau7fl4xbckqfhnb5wqo3fd55fh2k/lib/libomp.so)
==624918== by 0x3921C44B: __kmp_serial_initialize (in /home/spack/spack/opt/spack/linux-debian11-zen2/clang-13.0.1/llvm-13.0.0-skvgau7fl4xbckqfhnb5wqo3fd55fh2k/lib/libomp.so)
==624918== by 0x4A9F292: __run_fork_handlers (register-atfork.c:135)
==624918== by 0x4AE5547: fork (fork.c:129)
==624918== by 0x6C7109A: freetensor::Driver::buildAndLoad() (driver.cc:215)
==624918== by 0x6C728C0: freetensor::Driver::Driver(freetensor::Ref<freetensor::FuncNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, freetensor::Ref<freetensor::Device> const&, freetensor::Ref<freetensor::Device> const&, bool) (driver.cc:84)
==624918== by 0x6D87929: freetensor::Driver::Driver(freetensor::Ref<freetensor::FuncNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, freetensor::Ref<freetensor::Device> const&, bool) (driver.h:62)
==624918== by 0x6D7CF5A: freetensor::Schedule::tuneAutoSchedule(int, int, freetensor::Ref<freetensor::Device> const&, std::vector<freetensor::Ref<freetensor::Array>, std::allocator<freetensor::Ref<freetensor::Array> > > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, freetensor::Ref<freetensor::Array>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, freetensor::Ref<freetensor::Array> > > > const&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&)::{lambda(unsigned long)#1}::operator()(unsigned long) const (allocator.h:95)
==624918== Lock at 0x402B968 was first observed
==624918== at 0x48434E5: mutex_lock_WRK (hg_intercepts.c:942)
==624918== by 0x48473D3: pthread_mutex_lock (hg_intercepts.c:958)
==624918== by 0x4013895: _dl_open (dl-open.c:786)
==624918== by 0x4881257: dlopen_doit (dlopen.c:66)
==624918== by 0x4B5225F: _dl_catch_exception (dl-error-skeleton.c:208)
==624918== by 0x4B5231E: _dl_catch_error (dl-error-skeleton.c:227)
==624918== by 0x4881A64: _dlerror_run (dlerror.c:170)
==624918== by 0x48812E3: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==624918== by 0x61B610: ??? (in /usr/bin/python3.9)
==624918== by 0x61A2C9: ??? (in /usr/bin/python3.9)
==624918== by 0x5298C3: ??? (in /usr/bin/python3.9)
==624918== by 0x517B9A: _PyEval_EvalFrameDefault (in /usr/bin/python3.9)
==624918== Address 0x402b968 is 2312 bytes inside data symbol "_rtld_local"
It seems that dlopen
was holding a lock, but some function called by fork
destroyed the lock. This function was hooked by OpenMP into fork
, which was using dlsym
.
The above result is test against the OpenMP from LLVM 13. OpenMP from GCC 10.2.1 works fine.