armsve : non-multiple of NR for one or more datatypes
egaudry opened this issue · comments
Running c9700f3 on an ARM sve-enabled processor, fails with
libblis: frame/base/bli_gks.c (line 451):
libblis: Default NC is non-multiple of NR for one or more datatypes.
libblis: Aborting.
Can you confirm that there is not a failure using the commit's parent ee9ff98?
@xrq-phys any idea about this?
I wonder whether you could try this artifact (the aarch64-linux one).
Perhaps it'd help eliminating the compiler's influence.
@xrq-phys I tried this however it seems some symbols are missing:
dsyr2k_ (https://github.com/JuliaBinaryWrappers/blis_jll.jl/releases/download/blis-v0.8.1%2B2/blis.v0.8.1.aarch64-linux-gnu.tar.gz).
If you let know how you build it, I can try on my own and activate what I need.
Oops. I forgot there's a symbol mangling there.
On current master, in fact it should be OK to built via:
configure armsve; make -j$NPROC
.
My apologies.
GCC 10.3.0 & GCC 11.
@fgvanzee is there a "verbose" mode that can be enabled at runtime to let us see what the actual values are?
Normally, when I want to confirm the active cache and register blocksizes at runtime, I look at the informational header output of the testsuite. However, that might not have helped in this instance because the runtime check triggering the abort()
happens during library initialization, so running the testsuite would have given you the same error.
But it looks this this is moot now, yeah?
I built blis-master (and other older versions) using GCC-11.2 on CentOS-7.9, using the following
BLIS_TARGET=armsve
CC=gcc
FC=gfortran
CFLAGS="-fPIC -pthread -D_GNU_SOURCE -O2"
FFLAGS="-pthread -fPIC -D_GNU_SOURCE -fallow-argument-mismatch -O2"
LDFLAGS="-fPIC -pthread -Wl,-rpath='\$$\$\ORIGIN/../lib' -Wl,-rpath='\$$\$\ORIGIN/../../lib' -Wl,-rpath='\$$\$\ORIGIN/../../../lib'"
./configure --prefix=./JUST_THERE --disable-static --enable-shared --enable-threading=openmp --enable-cblas --enable-blas $BLIS_TARGET && make -j4 && make install
@fgvanzee is right that the issue happens right from the start, upon initialization.
@xrq-phys if #615 is about enabling SVE support on non A64FX processor, I will need to try for sure.
In the future, if you ever need to see what the blocksizes are, you could try to comment out the runtime check causing the abort()
, recompile, and then run the testsuite. It might bomb when running level-3 operations, but at least you'll get the info you want.
@devinamatthews I haven't checked the actual returned values, but #615 fixes the initialization issue indeed. However, the performance largely decreases (4x slower) compared to that of the thunderx2 build BLIS version (I built 4d83523+#615).
@fgvanzee thanks for the clarification, I'll try to remember this to provide more useful inputs when encountering this kind of issues in the future.
Thanks.
Unsure about the chip you're using, but if you could set the following env variables to your chip's actual value (approximated values are enough), it might be helpful to your perf. result:
BLIS_SVE_W_L1 # L1 number of sets
BLIS_SVE_N_L1 # L1 associativity degree
BLIS_SVE_C_L1 # L1 cache line size in bytes
BLIS_SVE_W_L2 # L2 number of sets
BLIS_SVE_N_L2 # L2 associativity degree
BLIS_SVE_C_L2 # L2 cache line size in bytes
BLIS_SVE_W_L3 # any big value
BLIS_SVE_N_L3 # 4 is OK
BLIS_SVE_C_L3 # any big value
@xrq-phys can you please set defaults appropriate for whatever the most common SVE chip is? (Maybe this is just A64fx for now?) It would be nice to be able to grab these from the OS or hwloc but that is more work.
@xrq-phys NVM I saw the comment in the PR that A64fx is the default.