prov/cxi: PR#9791 breaks build on LUMI
thomasgillis opened this issue · comments
Hi all,
I am trying to build the cxi provider on LUMI. The update merged in #9791 breaks the build process because lib-cxi
is too old.
I am using here the main branch with the patch suggested in #9789:
CC prov/cxi/test/multinode/prov_cxi_test_multinode_test_barrier-test_barrier.o
In file included from prov/cxi/test/multinode/test_coll.c:29:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
In file included from prov/cxi/test/multinode/multinode_frmwk.c:67:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
2799 | return cxi_cq_empty(cmdq->dev_cmdq);
| ^~~~~~~~~~~~
| cxi_eq_empty
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
2799 | return cxi_cq_empty(cmdq->dev_cmdq);
| ^~~~~~~~~~~~
| cxi_eq_empty
In file included from prov/cxi/test/multinode/test_frmwk.c:28:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
2799 | return cxi_cq_empty(cmdq->dev_cmdq);
| ^~~~~~~~~~~~
| cxi_eq_empty
In file included from prov/cxi/test/multinode/multinode_frmwk.c:67:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
In file included from prov/cxi/test/multinode/test_barrier.c:51:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
2799 | return cxi_cq_empty(cmdq->dev_cmdq);
| ^~~~~~~~~~~~
| cxi_eq_empty
Here are the command used:
module load PrgEnv-gnu-amd
module load libfabric/1.15.2.0
./autogen.sh
./configure --enable-cxi --with-rocr=${ROCM_PATH} --with-json=${HOME}/json-c-json-c-0.13.1-20180305 --prefix=$(pwd)/_inst
make install -j
and the version of the relevant libs
rpm -qa | grep cxi
cray-libcxi-retry-handler-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-devel-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-cxi-driver-devel-0.9-34.7__g22b90ec.SSHOT2.0.2.x86_64
cray-cxi-driver-kmp-cray_shasta_c-0.9_k5.14.21_150400.24.46_12.0.71-34.7__g22b90ec.SSHOT2.0.2.x86_64
cray-libcxi-dracut-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-utils-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-cxi-driver-udev-0.9-34.7__g22b90ec.SSHOT2.0.2.x86_64
I understand that the effort of open-sourcing cxi
is tedious and that the versioning problem might not be resolved easily/quickly. This specific issue is intended to track the issues we currently face. In the mean time, I have reverted the changes, the branch is available here: https://github.com/thomasgillis/libfabric/tree/dev-cxi
With the revert of the PR, the code compiles correctly on LUMI
@thomasgillis Thanks for the fix. I am able to build libfabric with cxi using your branch. But my application is failing at runtime with following error. (I am using sandia openSHMEM with libfabric and cxi as provider)
[0000] WARN: transport_ofi.c:1420: query_for_fabric
[0000] OFI transport did not find any valid fabric services (provider==cxi)
[0000] ERROR: init.c:466: shmem_internal_heap_postinit
[0000] Transport init failed (-61)
Can you suggest the solution?
It seems to be a provider selection issue in openSHMEM, I am afraid I cannot help you here :)
I would reach out to them directly
Copy/pasting my comment from #9793 (comment). We would really like to be able to build the cxi
provider on our production Slingshot systems. I'm not totally sure how we get there from here, but we may be able to utilize ALCF resources for CI.
FWIW, I've reached out to folks at ALCF to see if there's anything that can be done to support, at minimum, build testing of cxi on the Polaris machine here at Argonne. Ideally, once cxi is able to build on a production system, CI could prevent further breaking changes from going in. @jswaro is that something that would be of interest?
On perlmutter the configury does better than on systems with older sshot (pm has 2.1.2), but the configury fails with complaints about __user in a cxi related header file:
configure:35099: WARNING: cxi_prov_hw.h: present but cannot be compiled
configure:35099: WARNING: cxi_prov_hw.h: check for missing prerequisite headers?
configure:35099: WARNING: cxi_prov_hw.h: see the Autoconf documentation
configure:35099: WARNING: cxi_prov_hw.h: section "Present But Cannot Be Compiled"
configure:35099: WARNING: cxi_prov_hw.h: proceeding with the compiler's result
configure:35099: checking for cxi_prov_hw.h
configure:35099: result: no
configure:35108: checking uapi/misc/cxi.h usability
configure:35108: gcc -c -O2 -DNDEBUG -pipe -fvisibility=hidden -Wall -Wundef -Wpointer-arith conftest.c >&5
In file included from conftest.c:147:
/usr/include/uapi/misc/cxi.h:76:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
76 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:82:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
82 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:96:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
96 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:110:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
110 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:130:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
130 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:144:38: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
is this what you also see @raffenet
oh I'm on main at 717ebc5
is this what you also see @raffenet
I think @thomasgillis ran into this and ended up just adding
#define __user
somewhere to make that issue go away because its just a hint anyway.