ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/cxi: PR#9791 breaks build on LUMI

thomasgillis opened this issue · comments

Hi all,

I am trying to build the cxi provider on LUMI. The update merged in #9791 breaks the build process because lib-cxi is too old.
I am using here the main branch with the patch suggested in #9789:

CC       prov/cxi/test/multinode/prov_cxi_test_multinode_test_barrier-test_barrier.o
In file included from prov/cxi/test/multinode/test_coll.c:29:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
In file included from prov/cxi/test/multinode/multinode_frmwk.c:67:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty
In file included from prov/cxi/test/multinode/test_frmwk.c:28:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty
In file included from prov/cxi/test/multinode/multinode_frmwk.c:67:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
In file included from prov/cxi/test/multinode/test_barrier.c:51:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty

Here are the command used:

module load PrgEnv-gnu-amd
module load libfabric/1.15.2.0
./autogen.sh
./configure --enable-cxi --with-rocr=${ROCM_PATH} --with-json=${HOME}/json-c-json-c-0.13.1-20180305 --prefix=$(pwd)/_inst
make install -j

and the version of the relevant libs

rpm -qa | grep cxi
cray-libcxi-retry-handler-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-devel-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-cxi-driver-devel-0.9-34.7__g22b90ec.SSHOT2.0.2.x86_64
cray-cxi-driver-kmp-cray_shasta_c-0.9_k5.14.21_150400.24.46_12.0.71-34.7__g22b90ec.SSHOT2.0.2.x86_64
cray-libcxi-dracut-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-utils-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-cxi-driver-udev-0.9-34.7__g22b90ec.SSHOT2.0.2.x86_64

I understand that the effort of open-sourcing cxi is tedious and that the versioning problem might not be resolved easily/quickly. This specific issue is intended to track the issues we currently face. In the mean time, I have reverted the changes, the branch is available here: https://github.com/thomasgillis/libfabric/tree/dev-cxi
With the revert of the PR, the code compiles correctly on LUMI

@thomasgillis Thanks for the fix. I am able to build libfabric with cxi using your branch. But my application is failing at runtime with following error. (I am using sandia openSHMEM with libfabric and cxi as provider)
[0000] WARN: transport_ofi.c:1420: query_for_fabric
[0000] OFI transport did not find any valid fabric services (provider==cxi)
[0000] ERROR: init.c:466: shmem_internal_heap_postinit
[0000] Transport init failed (-61)
Can you suggest the solution?

It seems to be a provider selection issue in openSHMEM, I am afraid I cannot help you here :)
I would reach out to them directly

Copy/pasting my comment from #9793 (comment). We would really like to be able to build the cxi provider on our production Slingshot systems. I'm not totally sure how we get there from here, but we may be able to utilize ALCF resources for CI.

FWIW, I've reached out to folks at ALCF to see if there's anything that can be done to support, at minimum, build testing of cxi on the Polaris machine here at Argonne. Ideally, once cxi is able to build on a production system, CI could prevent further breaking changes from going in. @jswaro is that something that would be of interest?

On perlmutter the configury does better than on systems with older sshot (pm has 2.1.2), but the configury fails with complaints about __user in a cxi related header file:

configure:35099: WARNING: cxi_prov_hw.h: present but cannot be compiled
configure:35099: WARNING: cxi_prov_hw.h:     check for missing prerequisite headers?
configure:35099: WARNING: cxi_prov_hw.h: see the Autoconf documentation
configure:35099: WARNING: cxi_prov_hw.h:     section "Present But Cannot Be Compiled"
configure:35099: WARNING: cxi_prov_hw.h: proceeding with the compiler's result
configure:35099: checking for cxi_prov_hw.h
configure:35099: result: no
configure:35108: checking uapi/misc/cxi.h usability
configure:35108: gcc -c -O2 -DNDEBUG -pipe -fvisibility=hidden -Wall -Wundef -Wpointer-arith    conftest.c >&5
In file included from conftest.c:147:
/usr/include/uapi/misc/cxi.h:76:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
   76 |         void __user *resp;
      |                     ^
/usr/include/uapi/misc/cxi.h:82:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
   82 |         void __user  *resp;
      |                      ^
/usr/include/uapi/misc/cxi.h:96:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
   96 |         void __user  *resp;
      |                      ^
/usr/include/uapi/misc/cxi.h:110:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
  110 |         void __user  *resp;
      |                      ^
/usr/include/uapi/misc/cxi.h:130:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
  130 |         void __user *resp;
      |                     ^
/usr/include/uapi/misc/cxi.h:144:38: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token

is this what you also see @raffenet

oh I'm on main at 717ebc5

is this what you also see @raffenet

I think @thomasgillis ran into this and ended up just adding

#define __user

somewhere to make that issue go away because its just a hint anyway.