Crashes when running predict_instances on a 3D image
ximion opened this issue · comments
Hi!
Recently, with some newly trained models but data on which Stardist has worked before, I am observing crashes when running model.predict_instances
(where model
is a StarDist3D
).
The inference will run to 100% completion, but then after being almost done, the software will crash with a segmentation fault.
To reproduce
The code used is basically from the example:
model = StarDist3D(None, name=model_name, basedir=model_dir)
img = normalize(img_input, 1, 99.8, axis=axis_norm)
labels, details = model.predict_instances(img, n_tiles=model._guess_n_tiles(img))
This also worked fine before, the only thing that has changed was that more labeled data was added for model training. The model itself was trained without issues.
Expected behavior
I would expect no crash, or at least an actionable message on what to do next, in case there was something wrong with the input data, instead of a segmentation fault.
Data and screenshots
Here is a debug backtrace of the crash - fortunately Stardist had debug symbols:
Thread 149054 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffafdffb6c0 (LWP 1798761)]
0x00007ffeef9cf6b8 in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1516
1516 stardist/lib/stardist3d_impl.cpp: No such file or directory.
(gdb) bt full
#0 0x00007ffeef9cf6b8 in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1516
new_label_if_labeled = <optimized out>
inside = <optimized out>
offset = <optimized out>
x = 2434
y = 38
z = 229
faces = <optimized out>
n_rays = <optimized out>
n_faces = <optimized out>
labels = <optimized out>
nz = <optimized out>
ny = 4166
nx = 4166
render_mode = 0
use_overlap_label = 0
overlap_label = 0
result = 0x7fef4f756010
polyverts = <optimized out>
bbox = {207, 317, 30, 247, 2357, 2626}
i = <optimized out>
curr_center = 0x77bcfe60
hs_convex = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
_M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x79857840,
_M_finish = 0x798585c0, _M_end_of_storage = 0x79858840}, <No data fields>}}, <No data fields>}
hs_kernel = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
_M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x77d68760,
_M_finish = 0x77d69ee0, _M_end_of_storage = 0x77d6a760}, <No data fields>}}, <No data fields>}
Environment:
- StarDist version: 0.8.3
- CSBDeep version: 0.7.3
- TensorFlow version: 2.12.0
- OS: Linux (Debian 12)
- Training & inference run on an AMD EPYC 7443P CPU
If needed, I can probably share some of the training or sample data, but it is fairly large.
Thank you for making Stardist! :-)
Hi @ximion and sorry for the late reply.
The inference will run to 100% completion, but then after being almost done, the software will crash with a segmentation fault.
The segfault happens in one of StarDist's C++ extensions (used for post-processing) and might be related due to shared libraries issues and/or OpenMP parallelization.
You can control the number of threads with the environment variable OMP_NUM_THREADS
. Do debug this, try setting this from within Python, e.g. by pasting this at the very top of your script or notebook:
import os
os.environ['OMP_NUM_THREADS'] = '1'
Even if this runs without crashing, it doesn't really solve the problem because the code will run much slower.
@uschmidt83 Hi, sorry for the long delay, I had to wait a bit for some hardware repairs, but everything is fully up and running again (except for StarDist ^^).
I tried your suggestion, but it still crashes the same way as before:
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffdb7a605e in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1516
1516 stardist/lib/stardist3d_impl.cpp: No such file or directory.
The only observed difference was it running slightly slower.
There is also no difference whether I execute in on the GPU or CPU (would not have assumed that to be the case, but tried it just to be sure).
So, I've tracked the crash down to the qh_fprintf
function of Qhull, where the newMessage
C array gets converted into a std::string
implicitly, which is then subsequently deallocated and somehow frees memory that was never actually allocated. From my understanding of the code, this should not actually happen, so I suspect some stack corruption from elsewhere that does not occur due to OpenMP (even when compiling without it, the bug happens). So it's either due to thread race conditions in other part of the code, or a good old regular stack corruption. Very fun.
It probably appeared now due to some piece of the code relying on undefined behavior, and a compiler update has triggered it (at least that's my guess, could be completely wrong).
It would likely take an extended Valgrind session to track this down, which is extra painful because TensorFlow is in the mix (and will yield tons of issues which may or may not be false positives), Python as well and in addition to that even with a small example we are allocating huge chunks of memory.
Somehow though, upgrading Qhull made the issue appear extremely rarely, and adding guards around the QhullQh
message functions made me not see the crash again (indicating threading issues, but I'd be careful with that conclusion - this might just be coincidental). Unfortunately, address sanitizer is still unhappy with the code, so the problem likely wasn't fix, just made to not manifest or manifest less.
Regardless, applying the above PR resolves(-ish) the issue for me! :-)
One thing to note is that the code always crashes in qh_fprintf
- that, and given that the original stack trace was pointing to some other place, makes me think we have memory corruption. Address sanitizer didn't catch any bad writes though, so it may actually be a different library that I haven't checked corrupting memory here (because Asan really should have caught that!). So, looking into places that call the message functions would be a good next thing to do, followed by, unfortunately, Valgrind.
Asan trace for reference:
AddressSanitizer: CHECK failed: asan_errors.cpp:109 "((free_stack->size)) > ((0))" (0x0, 0x0) (tid=249280)
#0 0x7f7befec22aa in CheckUnwind ../../../../src/libsanitizer/asan/asan_rtl.cpp:67
#1 0x7f7befee2c55 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
#2 0x7f7befe3078c in __asan::ErrorFreeNotMalloced::Print() ../../../../src/libsanitizer/asan/asan_errors.cpp:109
#3 0x7f7befec1e27 in __asan::ScopedInErrorReport::~ScopedInErrorReport() ../../../../src/libsanitizer/asan/asan_report.cpp:141
#4 0x7f7befebef1e in __asan::ReportFreeNotMalloced(unsigned long, __sanitizer::BufferedStackTrace*) ../../../../src/libsanitizer/asan/asan_report.cpp:239
#5 0x7f7befeb9ed7 in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:152
#6 0x7f7bed967399 in std::__new_allocator<char>::deallocate(char*, unsigned long) /usr/include/c++/12/bits/new_allocator.h:158
#7 0x7f7bed967399 in std::allocator_traits<std::allocator<char> >::deallocate(std::allocator<char>&, char*, unsigned long) /usr/include/c++/12/bits/alloc_traits.h:496
#8 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_destroy(unsigned long) /usr/include/c++/12/bits/basic_string.h:292
#9 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() /usr/include/c++/12/bits/basic_string.h:286
#10 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() /usr/include/c++/12/bits/basic_string.h:795
#11 0x7f7bed967399 in qh_fprintf stardist/lib/external/qhull_src/src/libqhullcpp/QhullUser.cpp:220
#12 0x7f7bed87c359 in qh_sethalfspace stardist/lib/external/qhull_src/src/libqhull_r/geom2_r.c:2051
#13 0x7f7bed87cb4a in qh_sethalfspace_all stardist/lib/external/qhull_src/src/libqhull_r/geom2_r.c:2104
#14 0x7f7bed949c63 in orgQhull::Qhull::runQhull(char const*, int, int, double const*, char const*) stardist/lib/external/qhull_src/src/libqhullcpp/Qhull.cpp:326
#15 0x7f7bed98e150 in qhull_volume_halfspace_intersection(double const*, double const*, int, float) stardist/lib/stardist3d_impl.cpp:692
#16 0x7f7bed986f59 in qhull_overlap_kernel(float const*, float const*, float const*, float const*, int const*, int, int, int) stardist/lib/stardist3d_impl.cpp:862
#17 0x7f7bed986f59 in _COMMON_non_maximum_suppression_sparse(float const*, float const*, float const*, int, int, int, float const*, int const*, float, int, int, int, bool*) [clone ._omp_fn.2] stardist/lib/stardist3d_impl.cpp:1261
#18 0x7f7bfc7d4d8d (/lib/x86_64-linux-gnu/libgomp.so.1+0x1cd8d)
#19 0x7f7d04e77043 (/lib/x86_64-linux-gnu/libc.so.6+0x89043)
#20 0x7f7d04ef75fb (/lib/x86_64-linux-gnu/libc.so.6+0x1095fb)
Very interesting, since the issue was back with my large dataset, I decided to comment out the offending message-print lines.
As soon as they were gone, GDB gave me the original crash backtrace again:
#0 0x00007ffedf4d76f9 in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1514
new_label_if_labeled = <optimized out>
inside = <optimized out>
offset = <optimized out>
x = <optimized out>
y = 1148
z = 131
faces = 0xbe682c0
n_rays = <optimized out>
n_faces = 188
labels = 0x1802d800
nz = <optimized out>
ny = 4070
nx = 4153
render_mode = 0
use_overlap_label = 0
overlap_label = 0
result = 0x7ff0e9bf2010
polyverts = 0x619000849e80
bbox = {127, 135, 1147, 1192, 3717, 3758}
i = <optimized out>
curr_center = 0x1c6f9b8c
hs_convex = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
_M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x62500191e100,
_M_finish = 0x62500191f800, _M_end_of_storage = 0x625001920100}, <No data fields>}}, <No data fields>}
hs_kernel = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
_M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x625000992100,
_M_finish = 0x625000993880, _M_end_of_storage = 0x625000994100}, <No data fields>}}, <No data fields>}
#1 0x00007ffeef5720b6 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
No symbol table info available.
#2 0x00007ffedf4e8e6b in _COMMON_polyhedron_to_label (dist=dist@entry=0x1baeb450, points=points@entry=0x1c6f9430, verts=verts@entry=0xa527a80,
faces=faces@entry=0xbe682c0, n_polys=n_polys@entry=839, n_rays=n_rays@entry=96, n_faces=<optimized out>, labels=<optimized out>,
nz=<optimized out>, ny=<optimized out>, nx=<optimized out>, render_mode=<optimized out>, verbose=<optimized out>,
use_overlap_label=<optimized out>, overlap_label=<optimized out>, result=<optimized out>) at stardist/lib/stardist3d_impl.cpp:1460
curr_dist = 0x1baf9fd0
curr_center = 0x1c6f9b8c
hs_convex = <optimized out>
hs_kernel = <optimized out>
i = 157
old_sigint_handler = 0x678c70
polyverts = 0x619000849e80
bbox = <optimized out>
#3 0x00007ffedf4be097 in c_polyhedron_to_label (self=<optimized out>, args=<optimized out>) at stardist/lib/stardist3d.cpp:130
arr_dist = <optimized out>
arr_points = <optimized out>
arr_verts = <optimized out>
arr_faces = <optimized out>
arr_labels = <optimized out>
arr_result = 0x7ffc640fcb10
nz = <optimized out>
ny = <optimized out>
nx = <optimized out>
render_mode = <optimized out>
verbose = <optimized out>
use_overlap_label = <optimized out>
overlap_label = <optimized out>
n_polys = 839
n_rays = 96
n_faces = 188
dist = 0x1baeb450
points = 0x1c6f9430
verts = 0xa527a80
faces = 0xbe682c0
labels = 0x1802d800
dims_result = <optimized out>
result = <optimized out>
#4 0x0000000000547ae8 in ?? ()
No symbol table info available.
#5 0x0000000000517eb3 in _PyObject_MakeTpCall ()
The offending line is
int new_label_if_labeled = use_overlap_label?overlap_label:result[offset];
offset
was unfortunately optimized out, but since result
appears to be a valid pointer, the offset is probably wrong. Either that, or the array construction is wrong:
npy_intp dims_result[3];
dims_result[0] = nz;
dims_result[1] = ny;
dims_result[2] = nx;
// result array
arr_result = (PyArrayObject*)PyArray_ZEROS(3,dims_result,NPY_INT32,0);
int * result = (int*) PyArray_DATA(arr_result);
@maweigert / @uschmidt83 If you could have a look at this, that would be very appreciated. I will also probably have a go at it next week, but you know the code a lot better and can navigate it easier - so the issue might be more obvious to you. I would love to give an example to reproduce, but the crash appears only reliably with a very large dataset.
As soon as I turned off the computer yesterday, I realized the answer to this issue has been staring me right in the face...
It's just a classic integer overflow, easily fixed actually (and one of the really nice traps that C sets for programmers...).
With the PR above, this issue is fully resolved. ASan still complains, but the program doesn't seem to crash anymore.
I still se a lot of compiler warnings about unused variables and integer comparisons, but those are (so far) unrelated :-)