prov/verbs;ofi_rxm: fi_endpoint call after fork results in guaranteed mlx5 segfault

Question

prov/verbs;ofi_rxm: fi_endpoint call after fork results in guaranteed mlx5 segfault

krehm opened this issue 4 months ago · comments

I am running the Argonne dlio_benchmark with DAOS as the storage backend, specifically I am using a DAOS dfuse mountpoint plus the LD_PRELOAD=libpil4dfs.so library. Info on the libpil4dfs.so interception library is here and the libfabric version is 1.20.0. Detailed background information on the setup is available https://daosio.atlassian.net/browse/DAOS-15117. I am running the unet3d benchmark which uses pytorch, and which tries to spawn (fork/exec) 4 python reader processes to read samples in parallel for pytorch to process.

libpil4dfs.so has a child_hdlr() function which gets called in the child process after a fork as the result of a pthread_atfork() call. The function makes a series of fi_fabric(), fi_domain() and fi_endpoint() calls to create a new infiniband DAOS endpoint, as the prior endpoint is still owned by the parent process. The fi_endpoint() call fails with a segfault every time.

The mlx5 driver allocates pages for the creation of SRQs with mlx5_alloc_srq_buf(), which calls ibv_dontfork_range() on the allocated page range. After the fork, the new fi_endpoint() function calls __ibv_create_srq_1_1() to create an SRQ as part of endpoint setup, because of the dontfork call, this is a guaranteed segfault.

The libpil4dfs.so library is making fresh fi_fabric(), fi_domain() and fi_endpoint() calls after the fork, one would expect the new connection to succeed. However, the verbs provider is caching ibv_device structures in a static list called cma_dev_list. The dontfork pages allocated by mlx5 are hanging off the ibv_device for the requested domain. When fi_domain attempts to create a new domain, vrb_open_device_by_name() is called which finds the cached ibv_device which was created before the fork and associates that with the new domain. So it is guaranteed that the pages to be used for creating SRQs for the new connection are not in the child's memory map, and a segfault occurs.

Here is a traceback showing the re-use of an ibv_device from prior to the fork when the fresh fi_domain call is made. The SRQ pages associated with the ibv_device are not in the child process's memory.

#0  rdma_get_devices (num_devices=0x0) at ../librdmacm/cma.c:504
#1  0x00007f69c8380de6 in vrb_open_device_by_name (domain=0x56538ae12f10,
    name=0x56538b3254b0 "mlx5_1") at prov/verbs/src/verbs_domain.c:241
#2  0x00007f69c83811ff in vrb_domain (fabric=0x56538ae127b0, info=0x565388bf3cd0,
    domain=0x56538ae12e80, context=0x0) at prov/verbs/src/verbs_domain.c:351
#3  0x00007f69c83c4e7c in fi_domain (fabric=0x56538ae127b0, info=0x565388bf3cd0,
    domain=0x56538ae12e80, context=0x0) at ./include/rdma/fi_domain.h:356
#4  0x00007f69c83c776f in rxm_domain_open (fabric=0x56538ae03d30,
    info=0x56538b1b38d0, domain=0x56538b115288, context=0x0)
    at prov/rxm/src/rxm_domain.c:880
#5  0x00007f69c8b0ef48 in fi_domain (context=0x0, domain=0x56538b115288,
    info=0x56538b1b38d0, fabric=<optimized out>)
    at /mnt/nvm/rehm/build/install/prereq/release/ofi/include/rdma/fi_domain.h:356

The subsequent fi_endpoint() call then segfaults here:

#0  mlx5_create_srq (pd=0x563efa705de0, attr=0x7ffd1a9d79d0) at ../providers/mlx5/verbs.c:1371
#1  0x00007fe66a39e8c7 in __ibv_create_srq_1_1 (pd=0x563efa705de0, srq_init_attr=0x7ffd1a9d79d0)
    at ../libibverbs/ibverbs.h:87
#2  0x00007fe66b01b2ee in vrb_srq_context (domain=0x563efac1c4f0, attr=0x563efa6e3490,
    srx_fid=0x563efac20a88, context=0x0) at prov/verbs/src/verbs_ep.c:1804
#3  0x00007fe66b0518fe in fi_srx_context (domain=0x563efac1c4f0, attr=0x563efa6e3490, rx_ep=0x563efac20a88,
    context=0x0) at ./include/rdma/fi_endpoint.h:290
#4  0x00007fe66b057526 in rxm_open_core_res (ep=0x563efac1d880) at prov/rxm/src/rxm_ep.c:1743
#5  0x00007fe66b057b71 in rxm_endpoint (domain=0x563efac1c360, info=0x563efa989c30, ep_fid=0x563ef8158580,
    context=0x0) at prov/rxm/src/rxm_ep.c:1913
#6  0x00007fe66b7946e3 in fi_endpoint (context=0x0, ep=0x563ef8158580, info=0x563efa989c30,
    domain=<optimized out>) at /mnt/nvm/rehm/build/install/prereq/release/ofi/include/rdma/fi_endpoint.h:178

It seems to me that when a fork occurs, thereafter the vrb_open_device_by_name() function shouldn't be returning a pointer to an ibv_device created prior to the fork, a new ibv_device should be created after the fork, in which case the SRQ pages allocated by mlx5 will be in the child's memory and will work.

O.S. is Rocky 8.7 linux

Kevan Rehm · Answer 1 · Wed Feb 07 2024 00:33:33 GMT+0800 (China Standard Time)

As a sanity test, I temporarily commented out the ibv_dontfork_range() call in mlx5_alloc_srq_buf(), and the fresh fi_endpoint call in the child process then succeeds.

Chien Tin Tung · Answer 2 · Wed Feb 07 2024 05:02:08 GMT+0800 (China Standard Time)

Hi Kevan. Thank you for detailed bug report with what you have tried. I am also tracking https://daosio.atlassian.net/browse/DAOS-15117 but it seems the latest info is here. A few things to clarify. Libfabric depends on libibverbs and librdmacm from rdma-core. They are not part of libfabric. If you believe there are issues in those libraries (and/or MLX5 user-space driver), please email linux-rdma@vger.kernel.org. With that out of the way. I'm going through the code to see what's going on with ibv_device.

Kevan Rehm · Answer 3 · Wed Feb 07 2024 19:52:23 GMT+0800 (China Standard Time)

You make a good point, I was assuming that rdma/mlx5 was part of libfabric, but it isn't. And to make matters more complicated, I debugged this issue by downloading rdma from github and installing it in my python virtual environment so that I could use gdb and view code as I debugged. But in production on this machine we use MOFED, so it is likely that the MLX5 driver used in production is not the one I debugged. (later: no, libmlx5.so.1 is still from the libibverbs rpm, not MOFED.)

I understand if you think this ticket should be closed. I will need to open a ticket with linux-rdma, and/or with Mellanox instead.

Chien Tin Tung · Answer 4 · Wed Feb 07 2024 22:46:12 GMT+0800 (China Standard Time)

You can keep the issue open here. There may be things we need to do in libfabric to make fork better/easier. I am open to ideas.

Chien Tin Tung · Answer 5 · Thu Feb 08 2024 23:05:16 GMT+0800 (China Standard Time)

Is it possible to share your changes to DAOS stack starting from child_hdlr? From my reading of the code and description of your changes, I believe it should work. You can find my email in libfabric git (git grep chien) if you need to get in touch privately.

Kevan Rehm · Answer 6 · Thu Feb 08 2024 23:39:11 GMT+0800 (China Standard Time)

The only modifications are the addition of print statements, and I added a small spin loop in child_hdlr() where it waits for a /tmp/ file to be removed, so that I can attach gdb and debug the failure.

diff --git a/src/client/dfuse/pil4dfs/int_dfs.c b/src/client/dfuse/pil4dfs/int_dfs.c
index 80ff59264..828de857b 100644
--- a/src/client/dfuse/pil4dfs/int_dfs.c
+++ b/src/client/dfuse/pil4dfs/int_dfs.c
@@ -7,6 +7,7 @@
 #define D_LOGFAC     DD_FAC(il)

 #include <stdio.h>
+#include <stdlib.h>
 #include <dirent.h>
 #include <dlfcn.h>
 #include <sys/types.h>
@@ -902,6 +903,14 @@ child_hdlr(void)
        if (!daos_inited)
                return;

+       fprintf(stderr, "child_hdlr() after daos_inited\n");
+       rc = system("touch /tmp/rehm_sleep");
+       struct stat statb;
+       while (stat("/tmp/rehm_sleep", &statb) == 0) {
+               fprintf(stderr, "sleep\n");
+               sleep(1);
+       }
+
        daos_eq_lib_reset_after_fork();
        daos_dti_reset();
        td_eqh = main_eqh = DAOS_HDL_INVAL;
@@ -911,6 +920,7 @@ child_hdlr(void)
        else
                main_eqh = td_eqh;
        context_reset = true;
+       fprintf(stderr, "KREHM: child_hdlr() at end\n");
 }

 /** determine whether a path (both relative and absolute) is on DAOS or not. If
yes,
@@ -1020,6 +1030,7 @@ query_path(const char *szInput, int *is_target_path, dfs_ob
j_t **parent, char *i
                                main_eqh = td_eqh;
                                rc       = pthread_atfork(NULL, NULL, &child_hdlr
);
                                D_ASSERT(rc == 0);
+                               fprintf(stderr, "KREHM: pthread_atfork(NULL, NULL
, &child_hdlr)\n");
                        }

                        daos_inited = true;

The benchmark run looks like the following, IBV_FORK_SAFE is set.

mpirun -x IBV_FORK_SAFE=1 -x OMP_NUM_THREADS=1 -x MALLOC_CHECK_=3 -x LD_PRELOAD=/mnt/nvm/rehm/build/install/lib64/libpil4dfs.so -np 1 --map-by socket:PE=4 --display-map python3 dlio_benchmark/dlio_benchmark/main.py --config-path=/mnt/nvm/rehm/storage/storage-conf workload=unet3d ++workload.workflow.generate_data=False ++workload.workflow.train=True  ++workload.workflow.profiling=False ++workload.profiling.profiler=none ++hydra.output_subdir=configs ++hydra.run.dir=unet3d_results

I can tell that IBV_FORK_SAFE is working, because at the breakpoint in child_hdr() the variable mm_root is non-NULL, ibv_fork_init() has allocated it and filled it in.

Chien Tin Tung · Answer 7 · Fri Feb 09 2024 00:11:01 GMT+0800 (China Standard Time)

sorry, I was not clear, I was thinking of the call stack (from your other ticket in DAOS), all the changes starting from child_hdlr() -> daos_eq_lib_reset_after_fork() -> daos_eq_lib_init() -> ...
I want to recreate your seg fault.

Kevan Rehm · Answer 8 · Fri Feb 09 2024 00:14:05 GMT+0800 (China Standard Time)

Breakpointing in libfabric file prov/verbs/src/verbs_domain.c function vrb_open_device_by_name(), the code goes around the loop two times, first device is mlx5_0, not a match, second device is mlx5_1, which matches, so domain->verbs gets set to dev_list[i] which is the ibv_context * for the mlx5_1 entry.

The list being searched comes from rdma_get_devices(), which is walking the cma_dev_list, which is unchanged from before the fork.

Chien Tin Tung · Answer 9 · Fri Feb 09 2024 00:33:05 GMT+0800 (China Standard Time)

I am not an expert on librdmacm. I have to believe at some point fork worked as is. So that leaves what's missing in your changes. Feel free to bring up your question on cma_dev_list global with rdma-core.

Kevan Rehm · Answer 10 · Fri Feb 09 2024 00:38:52 GMT+0800 (China Standard Time)

So while I did do some subsequent experimentation after first encountering the segfault, the actual segfault occurs without any changes to the libfabric or rdma code or dlio_benchmark code. When the segfault first happened, I added the spin loop I mentioned above, and then just breakpointed my way through the code, down into libfabric and rdma. Recreating this in your environment is going to be a pain; you will need a DAOS config using infiniband, you'd have to install the main branch of dlio_benchmark, pick up the configuration I used for pytorch for the unet3d benchmark, you'd have to generate sample data, then finally run the benchmark. Sounds like a heavy lift for you?

Chien Tin Tung · Answer 11 · Fri Feb 09 2024 00:42:09 GMT+0800 (China Standard Time)

yes it is a heavy lift but I don't know how to make progress on this without a reproducer.
BTW, does it work if you use mlx5_0? or can you disable mlx5_0 so it does not get inserted into cma_dev_list?

Kevan Rehm · Answer 12 · Fri Feb 09 2024 00:44:55 GMT+0800 (China Standard Time)

OK, let me think about this, I'll see if I can find a way to reproduce this without a mountain of work.

Chien Tin Tung · Answer 13 · Fri Feb 09 2024 02:31:27 GMT+0800 (China Standard Time)

Try reverting commit 0e3e5c527008bc9dfa08e4aa1730ba5d9c099f86 in librdmacm.

Kevan Rehm · Answer 14 · Sat Feb 10 2024 00:39:40 GMT+0800 (China Standard Time)

I can duplicate the segfault with this test program, I will work to reduce it a bit more

(venv) [root@delphi-029 storage]# cat test.sh
#!/bin/bash

export LD_LIBRARY_PATH=/mnt/nvm/rehm/build/install/lib64:/mnt/nvm/rehm/rdma/rdma-core-47.1/build/lib:/mnt/nvm/rehm/build/install/prereq/release/ofi/lib:/usr/mpi/gcc/openmpi-4.1.7a1/lib64:/usr/mpi/gcc/openmpi-4.1.5rc2/lib64

export PIL4DFS=/mnt/nvm/rehm/build/install/lib64/libpil4dfs.so

export IBV_FORK_SAFE=1

cat > /tmp/test.py <<EOF
import os
import multiprocessing as mp

def child():
    print('child: pid {0} ppid {1} dir {2}'.format(os.getpid(), os.getppid(),
        os.listdir('/tmp/dfs24/dlio/dlio_benchmark/data/unet3d/')))

if __name__ == '__main__':
    d = os.listdir('/tmp/dfs24/dlio/dlio_benchmark/data/unet3d/')
    print('parent: pid {0} ppid {1} dir {2}'.format(os.getpid(), os.getppid(),
        os.listdir('/tmp/dfs24/dlio/dlio_benchmark/data/unet3d/')))

    mp.set_start_method('spawn')
    p = mp.Process(target=child)
    p.start()
    p.join()
EOF

LD_PRELOAD=$PIL4DFS python3 /tmp/test.py

Kevan Rehm · Answer 15 · Sat Feb 10 2024 04:22:05 GMT+0800 (China Standard Time)

You still have to build and install daos from source, I used the master branch, but other than that the above script should quickly segfault.

Chien Tin Tung · Answer 16 · Sat Feb 10 2024 05:47:05 GMT+0800 (China Standard Time)

I tried it and it just listed the directory I specified. how does verbs come into play? are you using a DAOS mounted FS?

Kevan Rehm · Answer 17 · Sat Feb 10 2024 06:05:48 GMT+0800 (China Standard Time)

Yes, the long pathname in the script is a path into a dfuse-mounted DAOS filesystem mounted at /tmp/dfs24. And then the libpil4dfs intercepts most of the filesystem calls directly, bypassing the dfuse mount.

dfuse -t 12 --disable-caching  --pool=perfpool --container=perfcont -m /tmp/pil4dfs/

If you comment out the IBV_FORK_SAFE environment variable, then the script will start to work, because that disables the ibv_dont_fork calls in mlx5.

Kevan Rehm · Answer 18 · Sat Feb 10 2024 06:49:44 GMT+0800 (China Standard Time)

I have instrumented libpil4dfs.so, libfabric, and rdma/mlx5 with lots of print messages. I am attaching the output from a test run where the segfault occurs. If you look at lines starting with ## those are annotations which I added after the run to show you the exact flow through the code, and how I end up with a guaranteed segfault.
out-annotated.txt

Neither libfabric nor rdma/mlx5 have any knowledge that a fork has occurred, so the call to vrb_open_device_by_name() that the child process makes from fi_domain/vrb_domain is guaranteed to set domain->verbs to the same ibv_context that was returned in the parent. And since that ibv_context contains mlx5 pages that had ibv_dont_fork set on them by the parent, the child's attempt to create a SRQ is going to segfault.

In rdma a fork, perhaps via a pthread_athfork() call, should cause the child to get its own mm_root, separate from the parent's mm_root, so that a fresh ibv_context is returned by vrb_open_device_by_name() in the child process, not the same one that was created by the parent. That fresh ibv_context will contain pages that are in the child's memory, and creation of a SRQ will work. The problem is that routine ibv_fork_init() checks if mm_root is non-zero, and if so, just returns, it doesn't create a fresh mm_root, so calling ibv_fork_init() in the child immediately after the fork doesn't help, a new mm_root won't be created, the parent's mm_root will be used.

Chien Tin Tung · Answer 19 · Mon Feb 12 2024 23:44:10 GMT+0800 (China Standard Time)

Kevan, please post out-annotated.txt to linux-rdma mailing list. I don't know the code but I have a suspicion that mlx5_alloc_drec needs to allocate a new doorbell page for the child process.

Chien Tin Tung · Answer 20 · Tue Feb 13 2024 03:51:32 GMT+0800 (China Standard Time)

Looks like this is the kernel patch you need. Try adding this to your MOFED install if the code is not there.

commit a0ffb4c12f7fa89163e228e6f27df09b46631db1
Author: Mark Zhang markzhang@nvidia.com
Date: Thu Jun 3 16:18:03 2021 +0300

RDMA/mlx5: Use different doorbell memory for different processes

In a fork scenario, the parent and child can have same virtual address and
also share the uverbs fd.  That causes to the list_for_each_entry search
return same doorbell physical page for all processes, even though that
page has been COW' or copied.

This patch takes the mm_struct into consideration during search, to make
sure that VA's belonging to different processes are not intermixed.

Resolves the malfunction of uverbs after fork in some specific cases.

Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Link: https://lore.kernel.org/r/feacc23fe0bc6e1088c6824d5583798745e72405.1622726212.git.leonro@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Kevan Rehm · Answer 21 · Tue Feb 13 2024 22:32:44 GMT+0800 (China Standard Time)

I have mlnx-ofa_kernel-23.10-OFED.23.10.0.5.5.1 installed, it already contains the patch you mention above, that's not the issue.

Jason Gunthorpe points out in the linux-rdma kernel reflector that a newer kernel is needed, one in which the DONT_FORK calls can be disabled in user space. If the DONT_FORK calls are disabled, then the segfault can't occur. He points to a commit in which ibv_is_fork_initialized has been modified to check if the kernel has enhanced fork support, and if so, it returns IBV_FORK_UNNEEDED. Users of rdma are then supposed to use that result to avoid making ibv_fork_init() calls. If the ibv_fork_init() call is not made, then subsequent ibv_dontfork_range() calls become no-ops. The libfabric efa provider already has code that does this.

The problem however lies in ibv_get_device_list() in libibverbs/device.c in rdma. That routine is called by vrb_have_device() in libfabric prov/verbs/src/verbs_info.c. Routine ibv_get_device_list() has the following code:

        if (!initialized) {
                if (ibverbs_init())
                        goto out;
                initialized = true;
        }

so ibverbs_init() is called unconditionally on the first call to ibv_get_device_list(). ibverbs_init() unconditionally checks for environment variables RDMA_FORK_SAFE and/or IBV_FORK_SAFE, and if either exists it calls ibv_fork_init. That call guarantees that mm_root will be created, and subsequent calls to ibv_dontfork_range() will therefore use ibv_madvise_range() to set MADV_DONTFORK, and eventually the segfault will occur.

Routine ibverbs_init() needs to change to check ibv_is_fork_initialized() first, and if the result is IBV_FORK_UNNEEDED, then the code should not call ibv_fork_init() regardless of the setting of any environment variables.

Earlier in this thread somewhere I mentioned that the segfault occured when I added IBV_FORK_SAFE to my test program, prior to that it did not segfault. Somewhere in the daos/mercury/libfabric stack one of those two environment variables must be getting set. But if ibverbs_init() gets fixed, then that won't matter.

I wrote a small C program that calls ibv_is_fork_initialized, and the current installed MOFED returns IBV_FORK_UNNEEDED, so sufficient fork support is already present, just need a fix to function ibverbs_init().

Kevan Rehm · Answer 22 · Tue Feb 13 2024 22:53:38 GMT+0800 (China Standard Time)

So a hack fix to ibverbs_init() causes my test program to pass, but the daos/mercury/libfabric path is still failing, need to do a bit more debugging.

Kevan Rehm · Answer 23 · Tue Feb 13 2024 23:35:03 GMT+0800 (China Standard Time)

So mpi4py calls PMPI_Init() which eventually makes UCX calls which ultimately call ibv_fork_init() without bothering to call ibv_is_fork_initialized() to see if calls to ibv_fork_init() should be avoided. I wonder how many other libraries also call ibv_fork_init without checing ibv_is_fork_initialized first.

(gdb) bt
#0  0x00007fc69cfac918 in nanosleep () from /lib64/libc.so.6
#1  0x00007fc69cfac81e in sleep () from /lib64/libc.so.6
#2  0x00007fc698f92962 in ibv_fork_init () at ../libibverbs/memory.c:136
#3  0x00007fc623775685 in uct_ib_md_open (component=<optimized out>,
    md_name=0x7ffd94519820 "mlx5_0", uct_md_config=0x5581bf785a00,
    md_p=0x7ffd945197e8) at base/ib_md.c:1047
#4  0x00007fc6996303db in uct_md_open (
    component=0x7fc6239de7a0 <uct_ib_component>,
    md_name=md_name@entry=0x7ffd94519820 "mlx5_0", config=<optimized out>,
    md_p=md_p@entry=0x5581bf780f50) at base/uct_md.c:61
#5  0x00007fc69987e4cc in ucp_fill_tl_md (tl_md=0x5581bf780f50,
    md_rsc=0x7ffd94519820, cmpt_index=4 '\004', context=0x5581bf77c990)
    at core/ucp_context.c:1306
#6  ucp_add_component_resources (context=context@entry=0x5581bf77c990,
    cmpt_index=cmpt_index@entry=4 '\004',
    avail_devices=avail_devices@entry=0x7ffd94519a20,
    avail_tls=avail_tls@entry=0x7ffd945199c0,
    dev_cfg_masks=dev_cfg_masks@entry=0x7ffd945199a0,
    tl_cfg_mask=tl_cfg_mask@entry=0x7ffd94519990, config=0x5581bf6b7850,
    aux_tls=0x7ffd945199f0) at core/ucp_context.c:1501
#7  0x00007fc69987eeac in ucp_fill_resources (
    context=context@entry=0x5581bf77c990, config=config@entry=0x5581bf6b7850)
    at core/ucp_context.c:1734
#8  0x00007fc699880179 in ucp_init_version (
    api_major_version=<optimized out>, api_minor_version=<optimized out>,
--Type <RET> for more, q to quit, c to continue without paging--
    params=0x7ffd94519bd0, config=0x5581bf6b7850,
    context_p=0x7fc623df2318 <ompi_pml_ucx+184>) at core/ucp_context.c:2179
#9  0x00007fc623bed39b in mca_pml_ucx_open ()
   from /usr/mpi/gcc/openmpi-4.1.7a1/lib64/openmpi/mca_pml_ucx.so
#10 0x00007fc68c2f0faf in mca_base_framework_components_open ()
   from /usr/mpi/gcc/openmpi-4.1.7a1/lib64/libopen-pal.so.40
#11 0x00007fc68c91dbf7 in mca_pml_base_open ()
   from /usr/mpi/gcc/openmpi-4.1.7a1/lib64/libmpi.so.40
#12 0x00007fc68c2faee1 in mca_base_framework_open ()
   from /usr/mpi/gcc/openmpi-4.1.7a1/lib64/libopen-pal.so.40
#13 0x00007fc68c927424 in ompi_mpi_init ()
   from /usr/mpi/gcc/openmpi-4.1.7a1/lib64/libmpi.so.40
#14 0x00007fc68c8bc861 in PMPI_Init ()
   from /usr/mpi/gcc/openmpi-4.1.7a1/lib64/libmpi.so.40
#15 0x00007fc68cc1fe1e in __pyx_pf_6mpi4py_3MPI_50Init (
    __pyx_self=<optimized out>) at src/mpi4py.MPI.c:165803
#16 __pyx_pw_6mpi4py_3MPI_51Init (__pyx_self=<optimized out>,
    __pyx_args=<optimized out>, __pyx_kwds=0x0) at src/mpi4py.MPI.c:34708
#17 0x00007fc69dfae9fb in cfunction_call () from /lib64/libpython3.9.so.1.0
#18 0x00007fc69dfdb4ac in _PyObject_MakeTpCall ()
   from /lib64/libpython3.9.so.1.0
#19 0x00007fc69e0573cf in _PyEval_EvalFrameDefault ()
   from /lib64/libpython3.9.so.1.0
#20 0x00007fc69dfd6693 in function_code_fastcall ()
   from /lib64/libpython3.9.so.1.0
...

I am going to try to move the ibv_is_fork_initialized() call directly into ibv_fork_init itself.

Kevan Rehm · Answer 24 · Tue Feb 13 2024 23:49:46 GMT+0800 (China Standard Time)

I made the following change to ibv_fork_init:

[root@delphi-029 libibverbs]# diff -C 5 memory.c.orig memory.c
*** memory.c.orig	2024-02-13 09:45:28.078997178 -0600
--- memory.c	2024-02-13 09:27:46.901699958 -0600
***************
*** 140,149 ****
--- 140,152 ----
  		huge_page_enabled = 1;

  	if (mm_root)
  		return 0;

+ 	if (ibv_is_fork_initialized() == IBV_FORK_UNNEEDED)
+ 		return 0;
+
  	if (too_late)
  		return EINVAL;

  	fprintf(stderr, "ibv_fork_init creating mm_root\n");
  	page_size = sysconf(_SC_PAGESIZE);

which prevents UCX from initializing mm_root. This code would work for ibverbs_init() as well. With this patch in place, the dlio_benchmark runs successfully.

Chien Tin Tung · Answer 25 · Tue Feb 13 2024 23:52:49 GMT+0800 (China Standard Time)

Great hacking. :-)

Kevan Rehm · Answer 26 · Wed Feb 14 2024 21:42:04 GMT+0800 (China Standard Time)

I dug a little deeper into UCX. We use openmpi because that's what MOFED installs, and openmpi uses UCX by default for MPI inter-process communication. The experience may be a bit different with mpich, I don't know, I haven't tested that.

If you add the following parameter to the mpirun command line:

-x UCX_IB_FORK_INIT=no

then UCX will not call ibv_fork_init(). It will still complain, warning about a possible memory corruption problem that doesn't occur anymore with kernels/MOFEDs with the latest fork support.

[1707915479.044463] [delphi-029:97380:0]           ib_md.c:853  UCX  WARN  IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1707917909.385108] [delphi-029:98997:0]           ib_md.c:854  UCX  WARN  IB: data corruption might occur when using registered memory.

but otherwise the benchmark will run. You have to also ensure that you do NOT pass either RDMA_FORK_SAFE or IBV_FORK_SAFE to mpirun, or libfabric will call ibv_fork_init() and you'll have the same segfault problem again.

Someone with more MPI skills than I could probably figure out a way to suppress the use of UCX and to use OFI or some other transport instead for inter-process communication, that would then avoid the bogus error messages above.

So there is a path to running the dlio_benchmark until such time as the linux-rdma team (hopefully) adds my patch to the next release of rdma. At that point all the FORK_SAFE stuff can fade into history...

Chien Tin Tung · Answer 27 · Thu Feb 15 2024 00:30:20 GMT+0800 (China Standard Time)

Please close this issue if you believe there is no change needed in Libfabric.

From OpenMPI doc - https://docs.open-mpi.org/en/main/tuning-apps/networking/ofi.html

mpirun --mca pml cm --mca mtl ofi --mca mtl_ofi_provider_include psm2 mpi_hello