Running all examples in parallel in trimgrit branch causes mysterious errors
Torrencem opened this issue · comments
So when I mpirun -np <anything other than 1> ex-<1, 2 or 4>
in the examples folder for the trimgrit branch, I get some mysterious runtime errors (usually double free's):
Example 1:
Braid: Begin simulation, 10 time steps
free(): double free detected in tcache 2
[XPS-8700:55570] *** Process received signal ***
[XPS-8700:55570] Signal: Aborted (6)
[XPS-8700:55570] Signal code: (-6)
Braid: || r_0 || not available, wall time = 1.15e-04
[XPS-8700:55570] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fad9fcda210]
[XPS-8700:55570] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fad9fcda18b]
[XPS-8700:55570] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fad9fcb9859]
[XPS-8700:55570] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fad9fd243ee]
[XPS-8700:55570] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fad9fd2c47c]
[XPS-8700:55570] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9a0ed)[0x7fad9fd2e0ed]
[XPS-8700:55570] [ 6] ex-01(+0x2745)[0x55fe646de745]
[XPS-8700:55570] [ 7] ex-01(+0x141a5)[0x55fe646f01a5]
[XPS-8700:55570] [ 8] ex-01(+0x8c58)[0x55fe646e4c58]
[XPS-8700:55570] [ 9] ex-01(+0xa8ae)[0x55fe646e68ae]
[XPS-8700:55570] [10] ex-01(+0x83ed)[0x55fe646e43ed]
[XPS-8700:55570] [11] ex-01(+0x2e85)[0x55fe646dee85]
[XPS-8700:55570] [12] ex-01(+0x2b2c)[0x55fe646deb2c]
[XPS-8700:55570] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fad9fcbb0b3]
[XPS-8700:55570] [14] ex-01(+0x250e)[0x55fe646de50e]
[XPS-8700:55570] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node XPS-8700 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
example 2:
Braid: Begin simulation, 64 time steps
Braid: || r_0 || = 4.360990e+00, conv factor = 1.00e+00, wall time = 7.43e-04
free(): double free detected in tcache 2
[XPS-8700:55640] *** Process received signal ***
[XPS-8700:55640] Signal: Aborted (6)
[XPS-8700:55640] Signal code: (-6)
[XPS-8700:55640] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1ebecdb210]
[XPS-8700:55640] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f1ebecdb18b]
[XPS-8700:55640] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1ebecba859]
[XPS-8700:55640] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f1ebed253ee]
[XPS-8700:55640] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f1ebed2d47c]
[XPS-8700:55640] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9a0ed)[0x7f1ebed2f0ed]
[XPS-8700:55640] [ 6] ex-02(+0x36dc)[0x561b96ccc6dc]
[XPS-8700:55640] [ 7] ex-02(+0x19326)[0x561b96ce2326]
[XPS-8700:55640] [ 8] ex-02(+0xddd9)[0x561b96cd6dd9]
[XPS-8700:55640] [ 9] ex-02(+0xfa2f)[0x561b96cd8a2f]
[XPS-8700:55640] [10] ex-02(+0xd56e)[0x561b96cd656e]
[XPS-8700:55640] [11] ex-02(+0x5149)[0x561b96cce149]
[XPS-8700:55640] [12] ex-02(+0x4da8)[0x561b96ccdda8]
[XPS-8700:55640] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f1ebecbc0b3]
[XPS-8700:55640] [14] ex-02(+0x262e)[0x561b96ccb62e]
[XPS-8700:55640] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node XPS-8700 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
More details: I'm running ubuntu 20, I can provide a gdb stack trace if needed. This definitely does not happen on master. I can also check what commit is the cause later.
Thanks @Torrencem ! I'll look into it. I need to bring this in line with the recent changes on master anyway, so it will be a good time to verify the MGRIT code on this branch (I've been focusing on TriMGRIT here). I think it will be something minor.
Also, just because I had been debugging earlier (I thought it was a problem with my code), here's a stack trace from a double free in example 1 with -np 2:
Thread 1 "ex-01" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7b69859 in __GI_abort () at abort.c:79
#2 0x00007ffff7bd43ee in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7ffff7cfe285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff7bdc47c in malloc_printerr (
str=str@entry=0x7ffff7d005d0 "free(): double free detected in tcache 2")
at malloc.c:5347
#4 0x00007ffff7bde0ed in _int_free (av=0x7ffff7d2fb80 <main_arena>,
p=0x5555556ecde0, have_lock=0) at malloc.c:4201
#5 0x0000555555556745 in my_Free (app=0x55555569f170, u=0x5555556ecdf0)
at ex-01.c:137
#6 0x00005555555681a5 in _braid_BaseFree (core=0x5555556ec710,
app=0x55555569f170, u=0x5555556eccd0) at base.c:253
#7 0x000055555555cc58 in _braid_GridClean (core=0x5555556ec710,
grid=0x5555556eca60) at grid.c:85
#8 0x000055555555e8ae in _braid_FInterp (core=0x5555556ec710, level=1)
at interp.c:143
#9 0x000055555555c3ed in _braid_Drive (core=0x5555556ec710, localtime=0)
at drive.c:532
#10 0x0000555555556e85 in braid_Drive (core=0x5555556ec710) at braid.c:160
#11 0x0000555555556b2c in main (argc=1, argv=0x7fffffffda08) at ex-01.c:258
This is definitely helpful. I should be able to fix this quickly, but it will have to wait until a little later today. Thanks again!
This should be fixed now. The change works differently from what is in the master branch, but I think it is a better approach. I checked all of the regression tests except for the MFEM related ones (I'll check those later, when it's time to merge into master). Thanks again @Torrencem !