XBraid / xbraid

XBraid Parallel-in-Time Solvers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running all examples in parallel in trimgrit branch causes mysterious errors

Torrencem opened this issue · comments

So when I mpirun -np <anything other than 1> ex-<1, 2 or 4> in the examples folder for the trimgrit branch, I get some mysterious runtime errors (usually double free's):

Example 1:


  Braid: Begin simulation, 10 time steps
free(): double free detected in tcache 2
[XPS-8700:55570] *** Process received signal ***
[XPS-8700:55570] Signal: Aborted (6)
[XPS-8700:55570] Signal code:  (-6)
  Braid: || r_0 || not available, wall time = 1.15e-04
[XPS-8700:55570] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fad9fcda210]
[XPS-8700:55570] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fad9fcda18b]
[XPS-8700:55570] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fad9fcb9859]
[XPS-8700:55570] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fad9fd243ee]
[XPS-8700:55570] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fad9fd2c47c]
[XPS-8700:55570] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9a0ed)[0x7fad9fd2e0ed]
[XPS-8700:55570] [ 6] ex-01(+0x2745)[0x55fe646de745]
[XPS-8700:55570] [ 7] ex-01(+0x141a5)[0x55fe646f01a5]
[XPS-8700:55570] [ 8] ex-01(+0x8c58)[0x55fe646e4c58]
[XPS-8700:55570] [ 9] ex-01(+0xa8ae)[0x55fe646e68ae]
[XPS-8700:55570] [10] ex-01(+0x83ed)[0x55fe646e43ed]
[XPS-8700:55570] [11] ex-01(+0x2e85)[0x55fe646dee85]
[XPS-8700:55570] [12] ex-01(+0x2b2c)[0x55fe646deb2c]
[XPS-8700:55570] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fad9fcbb0b3]
[XPS-8700:55570] [14] ex-01(+0x250e)[0x55fe646de50e]
[XPS-8700:55570] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node XPS-8700 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

example 2:


  Braid: Begin simulation, 64 time steps
  Braid: || r_0 || = 4.360990e+00, conv factor = 1.00e+00, wall time = 7.43e-04
free(): double free detected in tcache 2
[XPS-8700:55640] *** Process received signal ***
[XPS-8700:55640] Signal: Aborted (6)
[XPS-8700:55640] Signal code:  (-6)
[XPS-8700:55640] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1ebecdb210]
[XPS-8700:55640] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f1ebecdb18b]
[XPS-8700:55640] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1ebecba859]
[XPS-8700:55640] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f1ebed253ee]
[XPS-8700:55640] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f1ebed2d47c]
[XPS-8700:55640] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9a0ed)[0x7f1ebed2f0ed]
[XPS-8700:55640] [ 6] ex-02(+0x36dc)[0x561b96ccc6dc]
[XPS-8700:55640] [ 7] ex-02(+0x19326)[0x561b96ce2326]
[XPS-8700:55640] [ 8] ex-02(+0xddd9)[0x561b96cd6dd9]
[XPS-8700:55640] [ 9] ex-02(+0xfa2f)[0x561b96cd8a2f]
[XPS-8700:55640] [10] ex-02(+0xd56e)[0x561b96cd656e]
[XPS-8700:55640] [11] ex-02(+0x5149)[0x561b96cce149]
[XPS-8700:55640] [12] ex-02(+0x4da8)[0x561b96ccdda8]
[XPS-8700:55640] [13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f1ebecbc0b3]
[XPS-8700:55640] [14] ex-02(+0x262e)[0x561b96ccb62e]
[XPS-8700:55640] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node XPS-8700 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

More details: I'm running ubuntu 20, I can provide a gdb stack trace if needed. This definitely does not happen on master. I can also check what commit is the cause later.

Thanks @Torrencem ! I'll look into it. I need to bring this in line with the recent changes on master anyway, so it will be a good time to verify the MGRIT code on this branch (I've been focusing on TriMGRIT here). I think it will be something minor.

If it helps, it appears to work in 2ba3ff3 but breaks on commit f261355 (just tested). Thanks for taking a look!

Also, just because I had been debugging earlier (I thought it was a problem with my code), here's a stack trace from a double free in example 1 with -np 2:

Thread 1 "ex-01" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7b69859 in __GI_abort () at abort.c:79
#2  0x00007ffff7bd43ee in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7ffff7cfe285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff7bdc47c in malloc_printerr (
    str=str@entry=0x7ffff7d005d0 "free(): double free detected in tcache 2")
    at malloc.c:5347
#4  0x00007ffff7bde0ed in _int_free (av=0x7ffff7d2fb80 <main_arena>, 
    p=0x5555556ecde0, have_lock=0) at malloc.c:4201
#5  0x0000555555556745 in my_Free (app=0x55555569f170, u=0x5555556ecdf0)
    at ex-01.c:137
#6  0x00005555555681a5 in _braid_BaseFree (core=0x5555556ec710, 
    app=0x55555569f170, u=0x5555556eccd0) at base.c:253
#7  0x000055555555cc58 in _braid_GridClean (core=0x5555556ec710, 
    grid=0x5555556eca60) at grid.c:85
#8  0x000055555555e8ae in _braid_FInterp (core=0x5555556ec710, level=1)
    at interp.c:143
#9  0x000055555555c3ed in _braid_Drive (core=0x5555556ec710, localtime=0)
    at drive.c:532
#10 0x0000555555556e85 in braid_Drive (core=0x5555556ec710) at braid.c:160
#11 0x0000555555556b2c in main (argc=1, argv=0x7fffffffda08) at ex-01.c:258

This is definitely helpful. I should be able to fix this quickly, but it will have to wait until a little later today. Thanks again!

This should be fixed now. The change works differently from what is in the master branch, but I think it is a better approach. I checked all of the regression tests except for the MFEM related ones (I'll check those later, when it's time to merge into master). Thanks again @Torrencem !