ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/psm3: symbol colisions with either psm or psm2

hzhou opened this issue · comments

Describe the bug
We have been fighting these segfaults for so long. Depend on the configurations and where the testing is done, we hit following:

+ mpichversion

mpichversion:197053 terminated with signal 11 at PC=7fb96300fecc SP=7ffce054e3b8.  Backtrace:
/lib64/libc.so.6(cfree+0x1c)[0x7fb96300fecc]
/lib64/ld-linux-x86-64.so.2(+0x1003a)[0x7fb96816b03a]
/lib64/libc.so.6(+0x39c99)[0x7fb962fc3c99]
/lib64/libc.so.6(+0x39ce7)[0x7fb962fc3ce7]
/lib64/libc.so.6(__libc_start_main+0xfc)[0x7fb962fac50c]
mpichversion[0x400ee7]
MPICH Version:    	4.1a1
MPICH Release date:	Thu May 12 00:48:01 CDT 2022
MPICH Device:    	ch4:ofi
MPICH configure: 	--prefix=/var/lib/jenkins-slave/workspace/mpich-main-special-tests/compiler/gnu/jenkins_configure/noweak/label/centos64/netmod/ch4-ofi/mpich-main/_inst --with-device=ch4:ofi --with-libfabric=embedded --disable-mlx --disable-weak-symbols --enable-large-tests --with-wrapper-dl-type=rpath
MPICH CC: 	gcc -std=gnu99    -O2
MPICH CXX: 	g++   -O2
MPICH F77: 	gfortran   -O2
MPICH FC: 	gfortran   -O2
MPICH Custom Information: 	
Build step 'Run with timeout' marked build as failure

The backtrace shows it is inside an at_exit handler in psm3, but I can't figure out how it is segfaulting (in one of the free).

Today, when I manually playing with it, I hit this: (the first line is a print I added)

$ ./cpi
psmi_verno_isinteroperable: verno=110, PSMI_VERNO_GET_MAJOR(verno)=1, PSM2_VERNO_MAJOR=3, compare=300, psmi_verno = 300
pmrs-gpu-240-02.cels.anl.gov.221137psmi_verno_isinteroperable() not updated for current version!

cpi:221137 terminated with signal 6 at PC=7f0f51adc387 SP=7ffed8123e28.  Backtrace:
/lib64/libc.so.6(gsignal+0x37)[0x7f0f51adc387]
/lib64/libc.so.6(abort+0x148)[0x7f0f51adda78]
/lib64/libpsm_infinipath.so.1(+0x19b4a)[0x7f0f50ec2b4a]
/lib64/libpsm_infinipath.so.1(+0x19fb1)[0x7f0f50ec2fb1]
/lib64/libpsm_infinipath.so.1(__psm_init+0x24a)[0x7f0f50ec97fa]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(+0xb2b685)[0x7f0f5299f685]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(fi_getinfo+0x203)[0x7f0f528e7c43]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIDI_OFI_find_provider+0x95)[0x7f0f52413675]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIDI_OFI_init_local+0x158)[0x7f0f523e6958]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPID_Init+0x290)[0x7f0f52396d80]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPII_Init_thread+0x1f1)[0x7f0f522f9fd1]
/home/zhouh/temp/mpich-main/_inst/lib/libpmpi.so.0(MPIR_Init_impl+0x56)[0x7f0f522faa76]
/home/zhouh/temp/mpich-main/_inst/lib/libmpi.so.0(MPI_Init+0x1e)[0x7f0f530d916e]
./cpi[0x400a19]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0f51ac8555]
./cpi[0x4008f9]

It is called in __psm2_init in prov/psm3/psm3/psm.c and for hours I couldn't understand how that happened, until it hit me that it is not actually running the __psm2_init in psm3, but must be running the one in libpsm2.

I understand there is code history and there is marketing, but do we have to keep the messy names inside the psm3 code? Can we rename all the names with e.g. psm3_ to avoid collisions?

Currently we have no solutions but to pass in --disable-psm2 --disable-psm

Can you try updating to latest ofi/psm3, we have renamed a bunch of symbols with the most recent release.

Can you try updating to latest ofi/psm3, we have renamed a bunch of symbols with the most recent release.

Sounds good! Do you have a commit hash/PR for the renaming updates?

Found the PR -- #7521

We have confirmed upgrading to v1.15.0 fixed the issue (pmodels/mpich#6006)

Thank you. for verifying.