Tests fail on POWER machines

Question

Tests fail on POWER machines

ivan23kor opened this issue 2 years ago · comments

testsuite-run-fast fails on POWER9 and POWER10 with error message:
4440 Segmentation fault ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite
POWER7 dgemm microkernel includes altivec.h, which defines type bool. Because of that, edge-case handling macros here and here use a conflicting type bool from stdbool. A possible solution is to check for POWER architecture in edge_case_macros and define _bool as int for POWER and bool for other architectures.
POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in #620). Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

Devin Matthews · Answer 1 · Thu Mar 10 2022 23:53:11 GMT+0800 (China Standard Time)

@ivan23kor can QEMU be used to test POWER 7/8/9/10? If so what are the necessary flags so that we can add Travis CI tests?

Ivan Korostelev · Answer 2 · Sat Mar 19 2022 00:15:38 GMT+0800 (China Standard Time)

@devinamatthews unfortunately I don't know about QEMU as I am running on actual hardware.

RajalakshmiSR · Answer 3 · Thu Mar 24 2022 21:27:35 GMT+0800 (China Standard Time)

Looks like this is happening on POWER9 after ee9ff98

This is the stack trace :
Program received signal SIGSEGV, Segmentation fault.
bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608, cs_c=270180544,
data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138
138 INSERT_GENTFUNC_BASIC3( gemmtrsmbb_l, BLIS_CNAME_INFIX, BLIS_REF_SUFFIX, BLIS_TRSM_L_UKR )
(gdb) bt
#0 bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608,
cs_c=270180544, data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138
#1 0x000000001010b2a8 in bli_sgemmtrsm_l_ukernel (m=4, n=16, k=100, alpha=0x7fffffffccf0, a1x=0x7ffff6ef10c0, a11=0x7ffff6ef1700, bx1=0x7fffebd51098, b11=0x7fffebd52998,
c11=0x101aa0c0, rs_c=16, cs_c=1, data=0x7fffffffc200, cntx=0x101a08c0) at frame/3/bli_l3_ukr_tapi.c:124
#2 0x000000001003da60 in bli_gemmtrsm_ukernel (alpha=, a1x=, a11=0x7fffffffc5d0, bx1=, b11=, c11=,
cntx=0x101a08c0) at frame/3/bli_l3_ukr_oapi.c:182
#3 0x000000001000bbec in libblis_test_gemmtrsm_ukr_impl (iface=, side=, alpha=, a1x=, a11=,
bx1=, b11=, c11=, cntx=0x101a08c0) at testsuite/src/test_gemmtrsm_ukr.c:435
#4 0x000000001000c534 in libblis_test_gemmtrsm_ukr_experiment (params=0x7fffffffe130, op=, iface=, dc_str=,
pc_str=, sc_str=, p_cur=, perf=0x7fffffffd040, resid=0x7fffffffd038) at testsuite/src/test_gemmtrsm_ukr.c:355
#5 0x0000000010017d94 in libblis_test_op_driver (tdata=0x101a3e00, params=0x7fffffffe130, op=, iface=, op_str=0x1014a810 "gemmtrsm_ukr",
p_types=, o_types=, thresh=0x10190638 , f_exp=0x1000c0d0 <libblis_test_gemmtrsm_ukr_experiment>) at testsuite/src/test_libblis.c:2200
#6 0x000000001000bb7c in libblis_test_gemmtrsm_ukr (tdata=0x101a3e00, params=0x7fffffffe130, op=0x7fffffffde30) at testsuite/src/test_gemmtrsm_ukr.c:151
#7 0x0000000010011b9c in libblis_test_level3_ukrs (tdata=0x101a3e00, params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:311
#8 0x0000000010011d94 in libblis_test_all_ops (tdata=0x101a3e00, params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:230
#9 0x0000000010011e00 in libblis_test_thread_entry (tdata_void=) at testsuite/src/test_libblis.c:115
#10 0x0000000010011f44 in libblis_test_thread_decorator (params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:176
#11 0x00000000100019e8 in main (argc=, argv=0x7fffffffe5d8) at testsuite/src/test_libblis.c:84

So there is no special TRSM kernel in the power9 directory. I could see only specialized SGEMM and DGEMM kernel for POWER9.

Alexander Grund · Answer 4 · Thu Jul 21 2022 20:23:46 GMT+0800 (China Standard Time)

Looks like this is happening on POWER9 after ee9ff98

Having done a git bisect on a POWER9 system I can confirm this too.

Further investigation points at missing modification of ref_kernels/3/bb/bli_gemmtrsmbb_ref.c (parameters m&n are not introduced)

After changing that file similar to ee9ff98#diff-512c6a50b6244efae7b93e599fb1b295377718d298cac465e3f83db3718c4b03 I see many failures of the kind

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_cgemm1m_nn_ccc                100   100   100     1.34   1.97e-08   PASS
blis_cgemm_nn_ccc                  100   100   100     3.64   1.09e-08   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_zgemm1m_nn_ccc                100   100   100    17.96   3.82e-03   FAILURE
blis_zgemm_nn_ccc                  100   100   100     4.16   2.25e-17   PASS

I.e. Only the "z" datatype on the "*1m" is affected.

Using the "generic" configuration works.

Devin Matthews · Answer 5 · Thu Jul 21 2022 23:34:53 GMT+0800 (China Standard Time)

@fgvanzee the issue seems to be missing edge case handling in gemmtrsmbb_ref. Since you fixed the other trsm/gemmtrsm kernels can you fix this?

Field G. Van Zee · Answer 6 · Thu Jul 21 2022 23:44:07 GMT+0800 (China Standard Time)

I'll take a look at it.

Field G. Van Zee · Answer 7 · Fri Jul 22 2022 07:08:09 GMT+0800 (China Standard Time)

Uhh, am I missing something, @devinamatthews?

$ ls ref_kernels/3
bli_gemm_ref.c  bli_gemmsup_ref.c  bli_gemmtrsm_ref.c  bli_trsm_ref.c  old

It's not clear what I need to fix.

Devin Matthews · Answer 8 · Fri Jul 22 2022 07:16:31 GMT+0800 (China Standard Time)

The bb kernels live in some strangebplace. On Jul 21, 2022, at 6:08 PM, "Field G. Van Zee" ***@***.******@***.***>> wrote: [EXTERNAL SENDER] Uhh, am I missing something, @devinamatthews<https://github.com/devinamatthews>? $ ls ref_kernels/3 bli_gemm_ref.c bli_gemmsup_ref.c bli_gemmtrsm_ref.c bli_trsm_ref.c old — Reply to this email directly, view it on GitHub<#621 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABIAZIODBD5UQXDEIJ52MHDVVHJ6HANCNFSM5QM5ICKQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Field G. Van Zee · Answer 9 · Fri Jul 22 2022 07:31:40 GMT+0800 (China Standard Time)

The bb kernels live in some strangebplace.

Oh wait, didn't we fold them into the conventional reference microkernels?

Field G. Van Zee · Answer 10 · Fri Jul 22 2022 08:18:34 GMT+0800 (China Standard Time)

@devinamatthews Ah, I think I see the problem now. It's not that the gemmtrsm ukernels lack edge-case handling; it's that they haven't been updated according to the latest way that the bb functionality has been folded in. (By contrast, it appears that the gemm ukernel has been updated. So I'll use that as my guide.)

I'll try to work on this more tomorrow.

Devin Matthews · Answer 11 · Fri Jul 22 2022 22:04:23 GMT+0800 (China Standard Time)

@fgvanzee gemmtrsm reference kernels has been updated:

	PASTEMAC(ch,bcastbbs_mxn) \
	( \
	  m, \
	  n, \
	  b11, rs_b, cs_b  \
	); \

etc. @Flamefire did you test the tip of the master branch (currently af3a41e)?

Field G. Van Zee · Answer 12 · Fri Jul 22 2022 22:37:07 GMT+0800 (China Standard Time)

@devinamatthews Ah, yes, I had previously overlooked this:

    const inc_t rs_b   = packnr; \
    const inc_t cs_b   = bli_cntx_get_blksz_def_dt( dt, BLIS_BBN, cntx ); \

Alexander Grund · Answer 13 · Mon Jul 25 2022 15:04:16 GMT+0800 (China Standard Time)

I successfully tested af3a41e on Power9LE

Do you know which commit fixed that?

Devin Matthews · Answer 14 · Mon Jul 25 2022 23:23:26 GMT+0800 (China Standard Time)

The Power9/10 kernels use a "broadcast packing" format, which messed up a lot of the older code, or rather, required some bespoke code which interacted poorly with [cz]gemm1m. I rewrote all of the reference packing kernels in ae10d94 in part to streamline this issue (note that that commit has a bug which was later fixed in 667f201.

I guess we can close the issue then?

Alexander Grund · Answer 15 · Tue Jul 26 2022 15:00:18 GMT+0800 (China Standard Time)

Yes, thanks!

RajalakshmiSR · Answer 16 · Tue Jul 26 2022 23:24:24 GMT+0800 (China Standard Time)

Thanks for fixing this. make check now shows only these failures on POWER9.
Filename: out.zblat3

******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY *******
******* ZHEMM FAILED ON CALL NUMBER:
274: ZHEMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .

ZSYMM PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY *******
******* ZSYMM FAILED ON CALL NUMBER:
274: ZSYMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .

Alexander Grund · Answer 17 · Tue Jul 26 2022 23:46:42 GMT+0800 (China Standard Time)

@RajalakshmiSR Maybe try compiling with -fstack-protector-strong when using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.

Devin Matthews · Answer 18 · Wed Jul 27 2022 05:40:40 GMT+0800 (China Standard Time)

@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:

#include "blis.h"
#include <stdio.h>
#include <string.h>

int main(int argc, char** argv)
{
    obj_t A, B, C, a, b, c;

    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &A );
    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &B );
    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &C );

    bli_setm( &BLIS_ONE, &A );
    bli_setm( &BLIS_ONE, &B );
    bli_setm( &BLIS_ZERO, &C );

    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &A ), 1, 2, &a );
    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &B ), 1, 2, &b );
    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &C ), 1, 2, &c );
    bli_obj_set_struc( BLIS_HERMITIAN, &a );
    bli_obj_set_uplo( BLIS_UPPER, &a );

    bli_printm( "before:", &C, "%4.1f", "" );
    
    bli_hemm( BLIS_RIGHT, &BLIS_ONE, &a, &b, &BLIS_ZERO, &c );
    
    bli_printm( "after:", &C, "%4.1f", "" );

    return 0;
}

It should print:

before:
 0.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

after:
 1.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0

Any change in the zero values in the "after" matrix would be a problem.

Field G. Van Zee · Answer 19 · Wed Jul 27 2022 23:42:26 GMT+0800 (China Standard Time)

Not sure if this is relevant, but I just realized this morning that unless @devinamatthews's recent changes (vis-a-vis merging the bb packm functionality into a single reference kernel) extends bb support to triangular packing, any bb-utilizing subconfiguration needs to #define BLIS_DISABLE_TRMM_RIGHT, as is done for power9:

$ grep -R BLIS_DISABLE_TRMM_RIGHT config
./config/old/haswellbb/bli_family_haswell.h:#define BLIS_DISABLE_TRMM_RIGHT
./config/power9/bli_family_power9.h:#define BLIS_DISABLE_TRMM_RIGHT

This macro, as its name suggests, disables right-side trmm, expressing it instead in terms of left-side trmm. This forces the triangular matrix to always be on the left, and therefore a column-preferential gemm microkernel will always be guaranteed to assume broadcast values to exist in the right-hand matrix.

Like power9, power10 also seems to use explicit broadcasting of B during packing, but it fails to set the above macro. That could cause issues for trmm, and possibly non-deterministic behavior elsewhere.

Devin Matthews · Answer 20 · Wed Jul 27 2022 23:43:45 GMT+0800 (China Standard Time)

And hemm/symm too right?

Field G. Van Zee · Answer 21 · Wed Jul 27 2022 23:45:45 GMT+0800 (China Standard Time)

Yes, that's right. hemm/symm also require their own macros (because the Hermitian and symmetric packing formats also would presumably not support explicit broadcast). And trmm3 (for same reasons as trmm).

Field G. Van Zee · Answer 22 · Wed Jul 27 2022 23:51:01 GMT+0800 (China Standard Time)

Also just noticed that power10's subconfig registers preferences for complex gemm ukernels without registering any actual complex ukernels. 🤔

void bli_cntx_init_power10( cntx_t* cntx )
{
    blksz_t blkszs[ BLIS_NUM_BLKSZS ];

    // Set default kernel blocksizes and functions.
    bli_cntx_init_power10_ref( cntx );

    // -------------------------------------------------------------------------

    // Update the context with optimized native gemm micro-kernels.
    bli_cntx_set_ukrs
    (
      cntx,

      // level-3
      BLIS_GEMM_UKR, BLIS_FLOAT,  bli_sgemm_power10_mma_8x16,
      BLIS_GEMM_UKR, BLIS_DOUBLE, bli_dgemm_power10_mma_8x8,

      BLIS_VA_END
    );

    // Update the context with storage preferences.
    bli_cntx_set_ukr_prefs
    (
      cntx,

      // level-3
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_FLOAT,    TRUE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_DOUBLE,   TRUE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_SCOMPLEX, FALSE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_DCOMPLEX, FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_FLOAT,    FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_FLOAT,    FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DOUBLE,   FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DOUBLE,   FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,

      BLIS_VA_END
    );

Could be harmless since 1m method code should always read the preference of the real domain ukernel. But nonetheless I think those extra entries should be nixed unless there's something I missing.

EDIT: It would appear that the gemm1m virtual microkernel is doing The Right Thing:tm: by grabbing the preference of the real domain ukernel, so the above is likely not a correctness bug:

    PASTECH(chr,gemm_ukr_ft) \
                      rgemm_ukr = bli_cntx_get_ukr_dt( dt_r, BLIS_GEMM_UKR, cntx ); \
    const bool        col_pref  = bli_cntx_ukr_prefers_cols_dt( dt_r, BLIS_GEMM_UKR, cntx ); \

Although it's still confusing AF to anyone who isn't ~~me and @devinamatthews~~ a core BLIS developer and therefore should be cleaned up.

Field G. Van Zee · Answer 23 · Wed Jul 27 2022 23:55:02 GMT+0800 (China Standard Time)

Another observation: power10 uses row-preferential ukernels (which use broadcast values of A, not B) but has broadcast-B values defined in config/power10/bli_kernel_defs_power10.h. 🤔

It's a wonder any of this power10 code works at all.

Field G. Van Zee · Answer 24 · Thu Jul 28 2022 00:08:59 GMT+0800 (China Standard Time)

Okay, I'm going to consult with @nicholaiTukanov, the original author of the power10 subconfig and kernel set, to make sure I understand how things should be registered before I begin suggesting any bugfixes.

RajalakshmiSR · Answer 25 · Thu Jul 28 2022 05:32:48 GMT+0800 (China Standard Time)

@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:

It should print:
before:
 0.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

after:
 1.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  
Any change in the zero values in the "after" matrix would be a problem.

Yes, I could see the same result.

RajalakshmiSR · Answer 26 · Thu Jul 28 2022 05:34:19 GMT+0800 (China Standard Time)

@RajalakshmiSR Maybe try compiling with -fstack-protector-strong when using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.

Still the same issue. Tried it like CFLAGS="-fstack-protector-strong" ./configure power9

Devin Matthews · Answer 27 · Thu Jul 28 2022 07:20:09 GMT+0800 (China Standard Time)

Yes, I could see the same result.

@RajalakshmiSR strange, that is exactly the test that is supposed to be failing in zblat3. I might write a Fortran version just in case that makes a difference.

Devin Matthews · Answer 28 · Thu Jul 28 2022 07:35:10 GMT+0800 (China Standard Time)

Here's a Fortran version. It can't get much more similar to the failing test.

program main

double complex A(2,2), B(2,2), C(2,2)
double complex alpha, beta

A(1,1) = (1.0,0.0)
A(2,1) = (1.0,0.0)
A(1,2) = (1.0,0.0)
A(2,2) = (1.0,0.0)

B(1,1) = (1.0,0.0)
B(2,1) = (1.0,0.0)
B(1,2) = (1.0,0.0)
B(2,2) = (1.0,0.0)

C(1,1) = (0.0,0.0)
C(2,1) = (0.0,0.0)
C(1,2) = (0.0,0.0)
C(2,2) = (0.0,0.0)

alpha = (1.0,0.0)
beta = (0.0,0.0)

WRITE(*,*)'before:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)

CALL ZHEMM('R','U',1,1,alpha,A,2,B,2,beta,C,2)

WRITE(*,*)'after:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)

END PROGRAM MAIN

Field G. Van Zee · Answer 29 · Thu Jul 28 2022 08:29:52 GMT+0800 (China Standard Time)

@ivan23kor I fixed what was probably a power10 bug in 5b29893, although I can't be sure how it would have manifested. If you can, please give this commit a try. (We don't have easy access to any Power hardware.)

Field G. Van Zee · Answer 30 · Sat Jul 30 2022 01:59:29 GMT+0800 (China Standard Time)

Sorry these initial questions weren't answered in a timely manner.

POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in #620).

@ivan23kor Thank you for catching that, and submitting the PR fix.

Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

For now, yes. That code could be transitioned from being a sandbox to being an addon, but even then you'd have to explicitly request it via configure.

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

Not at this time. We don't have any Power expertise or hardware access in our core development group. The collaborator who wrote the power9 dgemm microkernel has moved on to other adventures, although he still stays involved in our community. Care to comment, @nicholaiTukanov?

RajalakshmiSR · Answer 31 · Mon Aug 01 2022 20:21:13 GMT+0800 (China Standard Time)

We can submit a request in OSUOSL to access POWER9 systems for CI service.
https://osuosl.org/services/powerdev/
As of now, the only optimized fast kernel is dgemm for POWER9. There are no current plans to extend it for POWER9.

Nicholai Tukanov · Answer 32 · Tue Aug 02 2022 02:39:44 GMT+0800 (China Standard Time)

Hey @ivan23kor, glad to see someone using and testing the POWER kernels 😄

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

Due to other commitments, I will not be able to work on these. However, if one was interested, they could take my dgemm code as a starting point for other real domain kernels.

Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

This is intentional, since BLIS doesn't support low precision/integer datatypes. Hopefully, with all the changes @devinamatthews and @fgvanzee are making, this won't be the case soon.

Nisanth M P · Answer 33 · Mon Dec 12 2022 18:09:44 GMT+0800 (China Standard Time)

Not sure if this is relevant, but I just realized this morning that unless @devinamatthews's recent changes (vis-a-vis merging the bb packm functionality into a single reference kernel) extends bb support to triangular packing, any bb-utilizing subconfiguration needs to #define BLIS_DISABLE_TRMM_RIGHT, as is done for power9:
$ grep -R BLIS_DISABLE_TRMM_RIGHT config
./config/old/haswellbb/bli_family_haswell.h:#define BLIS_DISABLE_TRMM_RIGHT
./config/power9/bli_family_power9.h:#define BLIS_DISABLE_TRMM_RIGHT
This macro, as its name suggests, disables right-side trmm, expressing it instead in terms of left-side trmm. This forces the triangular matrix to always be on the left, and therefore a column-preferential gemm microkernel will always be guaranteed to assume broadcast values to exist in the right-hand matrix.

Like power9, power10 also seems to use explicit broadcasting of B during packing, but it fails to set the above macro. That could cause issues for trmm, and possibly non-deterministic behavior elsewhere.

@fgvanzee
Can you please help me understand, whether disabling right-side 'hmm' and 'symm' was done for performance improvement or functional correctness? (in the case of power9)

Field G. Van Zee · Answer 34 · Tue Dec 13 2022 00:51:03 GMT+0800 (China Standard Time)

Can you please help me understand, whether disabling right-side 'hmm' and 'symm' was done for performance improvement or functional correctness? (in the case of power9)

Functional correctness (although there are performance considerations lurking within).

We actually don't want to disable right-sided hemm and symm implementations. Having access to both left- and right-sided implementations allows us to efficiently handle mismatches between the storage of matrix C and the IO preference of the gemm microkernel -- via a transposition of the entire operation. (In those cases, in order to satisfy the microkernel preference, a left-sided case executes via the right-sided code path, or vice versa.)

But the right-sided hemm and symm code paths imply that "matrix B" (that is, the matrix on the right) is the one with Hermitian or symmetric structure. The power9 kernel requires the packed micropanels from "matrix B" (again, the matrix on the right) to be duplicated. And, for now, BLIS does not support packing a Hermitian/symmetric matrix to a buffer while also duplicating the elements.

So, what are the consequences of disabling right-sided hemm and symm? It means that all cases, both left and right, flow through the left-sided code path. (Notice that a right-sided case can be reexpressed as a left-sided case by transposing the entire operation -- the same technique we typically reserve for reconciling storage-of-C / IO-preference mismatches.) This means we can't always guarantee that the storage of C matches the IO preference of the microkernel, which means the microkernel's element-wise IO branch will sometimes execute (which is slower).

One way to mitigate this performance issue would be to add branches to the power9 microkernel that perform in-register transposes when there is a mismatch so that vector loads/stores could still be used to update C (rather than doing it element-by-element), but this in-register transpose is sometimes expensive and will only lessen (but never eliminate) the performance drag of calling the gemm microkernel on C with an unsatisfied IO preference.

Nisanth M P · Answer 35 · Tue Dec 13 2022 17:18:59 GMT+0800 (China Standard Time)

Thanks @fgvanzee for the explanation.

With regards to the current HEMM/SYMM failures (power9, BLAS tests): Since row and column strides of C are also swapped, in the case of inducing transpose when hermitian/symmetric matrix B multiplies A from the right, 1m optimisation's row and column strides (of C) calculations are going wrong. This happens in the case of BLIS_DISABLE_HEMM_RIGHT (or BLIS_DISABLE_SYMM_RIGHT) are defined. So checked after removing these macros in power9 case, and the tests are passing.

The other option is to fix the 1m optimisation's row and column strides (of C) calculations, while retaining BLIS_DISABLE_HEMM_RIGHT and BLIS_DISABLE_SYMM_RIGHT macros.

Please suggest the optimal path to take.

Field G. Van Zee · Answer 36 · Wed Dec 14 2022 01:54:05 GMT+0800 (China Standard Time)

@nisanthmp Thanks for that update. I don't quite understand the issue you described with 1m and the row/column strides of C. Is this something I would be able to reproduce on an Intel system (where the microkernel does not need to duplicate/broadcast elements of B during packing)?

Stepping through a concrete example might also help.

Nisanth M P · Answer 37 · Wed Dec 14 2022 12:31:25 GMT+0800 (China Standard Time)

@fgvanzee I'm trying to find an Intel based system, where I can reproduce the error. I need the 1m optimisations enabled while BLIS_DISABLE_HEMM_RIGHT defined. Is there a way to enable 1m optimisations on any Intel based system?

On power9 based system, I was able to reproduce the error with the code snippet, pasted below:

#include<stdio.h>
#include "blis.h"

#define DIM 1
#define LDM 2
int main() {

dcomplex a[LDM * DIM] = { {-0.412587, -10000000000.000000}, {-10000000000.000000, 10000000000.000000} };
dcomplex b[LDM * DIM] = { {0.336663, 0.050949}, {-10000000000.000000, 10000000000.000000} };
dcomplex c[LDM * DIM];
int I, J, M, N, lda, ldb, ldc;
dcomplex alpha, beta;

M = DIM;
N = M;
lda = 2;
ldb = 2;
ldc = 2;
alpha.real = 1.0; alpha.imag = 0.0;
beta.real = 0.0; beta.imag = 0.0;

printf("a = \n");
for ( I = 0; I < LDM; I ++ ) {
	for ( J = 0; J < N; J ++ ) {
		printf("%lf + i%lf\t", a[J * LDM + I].real, a[J * LDM + I].imag);
	}
	printf("\n");
}
printf("b = \n");
for ( I = 0; I < LDM; I ++ ) {
	for ( J = 0; J < N; J ++ ) {
		printf("%lf + i%lf\t", b[J * LDM + I].real, b[J * LDM + I].imag);
	}
	printf("\n");
}

zhemm_("R","U",&M,&N,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);

printf("c = \n");
for ( I = 0; I < LDM; I ++ ) {
	for ( J = 0; J < N; J ++ ) {
		printf("%lf + i%lf\t", c[J * LDM + I].real, c[J * LDM + I].imag);
	}
	printf("\n");
}

return 0;
}

Nisanth M P · Answer 38 · Wed Dec 14 2022 13:33:27 GMT+0800 (China Standard Time)

I will try to explain what I was trying to say with regards to 1m related optimisations and row and column strides of C:

When I run the above code, with 1m optimisations and BLIS_DISABLE_HEMM_RIGHT defined, I get the following output.

a =
-0.412587 + i-10000000000.000000
-10000000000.000000 + i10000000000.000000
b =
0.336663 + i0.050949
-10000000000.000000 + i10000000000.000000
Before 1m specific optimization:
dt_exec: 3, dt_c: 3, m: 1, n: 1, k: 1, pd_a: 6, ps_a: 12, pd_b: 6, ps_b: 12, rs_c: 2, cs_c: 1
After 1m specific optimization:
dt_exec: 2, dt_c: 2, m: 2, n: 1, k: 2, pd_a: 12, ps_a: 24, pd_b: 6, ps_b: 24, rs_c: 2, cs_c: 2
c =
-0.138903 + i0.000000
-0.021021 + i0.000000

The prints are in "frame/3/gemm/bli_gemm_ker_var2.c", around the 1m specific optimisations to call real domain macro-kernel instead of 1m virtual micro-kernel:

    // If 1m is being employed on a column- or row-stored matrix with a
    // real-valued beta, we can use the real domain macro-kernel, which
    // eliminates a little overhead associated with the 1m virtual
    // micro-kernel.
    // Only employ this optimization if the storage datatype of C is
    // equal to the execution/computation datatype.
#if 1
    printf("Before 1m specific optimization:\n");
    printf("dt_exec: %d, dt_c: %d, m: %d, n: %d, k: %d, pd_a: %d, ps_a: %d, pd_b: %d, ps_b: %d, rs_c: %d, cs_c: %d\n", dt_exec, dt_c, m, n, k, pd_a, ps_a, pd_b, ps_b, rs_c, cs_c);
    if ( bli_cntx_method( cntx ) == BLIS_1M )
    {
            bli_gemm_ind_recast_1m_params
            (
              &dt_exec,
              &dt_c,
              schema_a,
              c,
              &m, &n, &k,
              &pd_a, &ps_a,
              &pd_b, &ps_b,
              &rs_c, &cs_c
            );
            printf("After 1m specific optimization:\n");
            printf("dt_exec: %d, dt_c: %d, m: %d, n: %d, k: %d, pd_a: %d, ps_a: %d, pd_b: %d, ps_b: %d, rs_c: %d, cs_c: %d\n", dt_exec, dt_c, m, n, k, pd_a, ps_a, pd_b, ps_b, rs_c, cs_c);

    }
#endif

As you can see in the printed output, before calling "bli_gemm_ind_recast_1m_params()", row and column strides of C are rs_c: 2, cs_c: 1, and afterwards, they are rs_c: 2, cs_c: 2.

if the values of rs_c and cs_c were 1 and 4 respectively, the computation would produce correct results. That's why, I'm suspecting that the row and column strides calculations for C are going wrong.

Hope I'm able to explain this properly.

Nisanth M P · Answer 39 · Wed Dec 14 2022 15:58:36 GMT+0800 (China Standard Time)

@fgvanzee I could find an Intel system to reproduce this. The same test code listed above was run on a Xeon 8380 (Ice Lake) system, with BLIS_DISABLE_HEMM_RIGHT defined.

Field G. Van Zee · Answer 40 · Thu Dec 15 2022 01:36:19 GMT+0800 (China Standard Time)

@nisanthmp Thank you for all of this information. I will start trying to reproduce on my end.

Field G. Van Zee · Answer 41 · Thu Dec 15 2022 02:35:41 GMT+0800 (China Standard Time)

@nisanthmp I tried to use your code to reproduce the error as follows:

I added #define BLIS_DISABLE_HEMM_RIGHT to config/haswell/bli_family_haswell.h.
I configured BLIS with auto, which chose the haswell subconfig on my Intel desktop.
make
I compiled the test driver code you provided and ran it. It gave me this output, which I think is what it is supposed to give:

a = 
-0.412587 + i-10000000000.000000	
-10000000000.000000 + i10000000000.000000	
b = 
0.336663 + i0.050949	
-10000000000.000000 + i10000000000.000000	
Before 1m specific optimization:
dt_exec: 3, dt_c: 3, m: 1, n: 1, k: 1, pd_a: 6, ps_a: 6, pd_b: 4, ps_b: 8, rs_c: 2, cs_c: 1
After 1m specific optimization:
dt_exec: 2, dt_c: 2, m: 1, n: 2, k: 2, pd_a: 6, ps_a: 12, pd_b: 8, ps_b: 16, rs_c: 4, cs_c: 1
c = 
-0.138903 + i-0.021021	
0.000000 + i0.000000

Did I miss anything? Perhaps you could give further details on how you configured and built for your Intel Xeon 8380.

Nisanth M P · Answer 42 · Thu Dec 15 2022 11:29:16 GMT+0800 (China Standard Time)

@fgvanzee I was also not able to reproduce the error on all intel systems. I could do it only on the Ice Lake (Xeon 8380) based system, out of the available Intel systems that I have access to. The only difference in configuration was that, I configured for "x86_64" instead of "auto". Other than that, everything looks the same.

I tried with "configure auto" also (on Xeon 8380). That way it chose "skx", so added "#define BLIS_DISABLE_HEMM_RIGHT" in "config/skx/bli_family_skx.h " and the error is still there.

Field G. Van Zee · Answer 43 · Fri Dec 16 2022 00:47:49 GMT+0800 (China Standard Time)

I tried with "configure auto" also (on Xeon 8380). That way it chose "skx", so added "#define BLIS_DISABLE_HEMM_RIGHT" in "config/skx/bli_family_skx.h " and the error is still there.

@nisanthmp Ahhh, that helps (knowing the exact subconfig it chose)! Thanks for your reply. I'll take a closer look at skx and see if I can find anything out. (I might also have a SkylakeX system to test on, which, while not the same as Ice Lake should be comparable since it also uses the skx subconfig.)

~~EDIT: Could you confirm which version/commit of BLIS you are using?~~

Field G. Van Zee · Answer 44 · Fri Dec 16 2022 01:20:21 GMT+0800 (China Standard Time)

@nisanthmp I was able to reproduce the error on my haswell-based system by:

Switching to column-preferential microkernels in bli_cntx_init_haswell.c. (The skx subconfig uses kernels that prefer IO by columns.)
Disabling registration of all c and z kernels and blocksizes in bli_cntx_init_haswell.c. (The skx subconfig only registers real-domain kernels/values.) This forces use of the 1m implementation for complex datatypes.
Adding #define BLIS_DISABLE_HEMM_RIGHT to bli_family_haswell.h. (The skx subconfig does not disable any right-sided operations by default.)
Running your test driver code.

I'll continue to investigate and get back to you! Thank you for your patience and diligent bug reports.

Field G. Van Zee · Answer 45 · Fri Dec 16 2022 09:34:35 GMT+0800 (China Standard Time)

@nisanthmp #697 is the best I can do at fixing the problem. Basically, we have to avoid the 1m-specific optimization if BLIS_DISABLE_HEMM_RIGHT is defined (and if that results in the storage of C mismatching the gemm microkernel IO preference). Please confirm the fix on your end, and I'll merge it shortly thereafter.

Thanks for your feedback on this!

Nisanth M P · Answer 46 · Fri Dec 16 2022 13:11:11 GMT+0800 (China Standard Time)

@fgvanzee I checked #697 on a power9 system and it's working fine, please merge. Thank you for prioritising this and fixing it!

Nisanth M P · Answer 47 · Wed Jan 04 2023 17:58:14 GMT+0800 (China Standard Time)

@fgvanzee, Shall we please close this issue now?

Field G. Van Zee · Answer 48 · Thu Jan 05 2023 00:37:19 GMT+0800 (China Standard Time)

Certainly. Thank you again for your patience and feedback on this issue, @nisanthmp!