flame / blis

We are using BLIS in spaCy and have encountered access violations in CI when running on Windows, since updating to BLIS 0.9.0. I have tried to reproduce this issue with Linux and get out-of-bound reads with this small program:

https://gist.github.com/danieldk/b4f275e715fb8131d6fe5435e9f29a98

% BLIS_ARCH_DEBUG=1 BLIS_ARCH_TYPE=3 valgrind --leak-check=no ./test
==52125== Memcheck, a memory error detector
==52125== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==52125== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==52125== Command: ./test
==52125== 
libblis: selecting sub-configuration 'haswell'.
==52125== Invalid read of size 4
==52125==    at 0x491BA65: bli_sgemmsup_rv_haswell_asm_3x2 (bli_gemmsup_rv_haswell_asm_sMx2.c:1452)
==52125==    by 0x4903413: bli_sgemmsup_rv_haswell_asm_3x16n (bli_gemmsup_rv_haswell_asm_s6x16n.c:3563)
==52125==    by 0x4904C18: bli_sgemmsup_rv_haswell_asm_6x16n (bli_gemmsup_rv_haswell_asm_s6x16n.c:146)
==52125==    by 0x5358E4E: bli_sgemmsup_ref_var1n (bli_l3_sup_var1n2m.c:704)
==52125==    by 0x53594F6: bli_gemmsup_ref_var1n (bli_l3_sup_var1n2m.c:210)
==52125==    by 0x534E923: bli_gemmsup_int (bli_l3_sup_int.c:228)
==52125==    by 0x53D4198: bli_l3_sup_thread_decorator (bli_l3_sup_decor_single.c:113)
==52125==    by 0x5351774: bli_gemmsup_ref (bli_l3_sup_ref.c:109)
==52125==    by 0x534E40D: bli_gemmsup (bli_l3_sup.c:122)
==52125==    by 0x534CE6E: bli_gemm_ex (bli_l3_oapi_ex.c:70)
==52125==    by 0x5392619: sgemm_ (bla_gemm.c:264)
==52125==    by 0x53A7BE1: cblas_sgemm (cblas_sgemm.c:108)
==52125==  Address 0x57740f8 is 0 bytes after a block of size 28,728 alloc'd
==52125==    at 0x484586F: malloc (vg_replace_malloc.c:381)
==52125==    by 0x40151A: main (test.c:16)

On the other hand, when using penryn or sandybridge kernels, I don't get memory errors. The memory block that is read beyond is allocated at ==52125== by 0x40151A: main (test.c:16), which is the C (output) matrix.

BLIS was built with the following options:

./configure --enable-cblas --enable-blas --enable-debug=opt --export-shared=all --prefix=$PWD/out x86_64

gcc version:

% gcc --version
gcc (GCC) 12.1.1 20220507 (Red Hat 12.1.1-1)

@danieldk what BLIS version/commit are you using? Also, can you please post a minimal reproducer?

@danieldk what BLIS version/commit are you using?

Version 0.9.0 (14c86f6), but master (I tested 5677289) has the same issue.

Also, can you please post a minimal reproducer?

I am not sure I can come up with something more minimal than the linked gist?

https://gist.github.com/danieldk/b4f275e715fb8131d6fe5435e9f29a98

Missed that, sorry.

Looks like issue is with the vector load access in fringe case kernels where the load is trying to access out of bound. I am able to reproduce the memory leak issue. Will provide a fix.

Thanks, @BhaskarNallani.

Hi @fgvanzee ,
Here is the fix

-vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
+vmovsd(mem(rcx), xmm5)
+vfmadd231ps(xmm5, xmm3, xmm4)

-vfmadd231ps(mem(rcx, 0*32), xmm3, xmm6)
+vmovsd(mem(rcx), xmm5)
+vfmadd231ps(xmm5, xmm3, xmm6)

-vfmadd231ps(mem(rcx, 0*32), xmm3, xmm8)
+vmovsd(mem(rcx), xmm5)
+vfmadd231ps(xmm5, xmm3, xmm8)

While loading C and multiplying with alpha vfmadd231ps instruction loads directly from C memory which is trying to access 128 bit which is beyond the 2 float values that is 64 bit. In fringe cases when specially n dimension is less than 4 in sgemm and less than 2 in dgemm it may need correction.

https://github.com/flame/blis/blob/master/kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx2.c
Line numbers: 1686, 1691 and 1696

I will update for similar any bugs in other files in sup.

@danieldk Sorry it took so long to apply the fix to this bug. It fell off my radar, but I stumbled upon it again yesterday and went ahead and applied the fix to all functions in the file that Bhaskar identified above. Thanks for the bug report, and please let us know if you have any further issues, comments, or concerns.

I'll merge the fix into master soon.

@BhaskarNallani I've applied your fix to all six functions in bli_gemmsup_rv_haswell_asm_sMx2.c. If you found any other issues, either within that file or in neighboring files, please let me know.

Sorry it took so long to apply the fix to this bug.

No worries, thanks to you and @BhaskarNallani for fixing this! For the last spaCy release, we have reverted to BLIS 0.7, but we'll update to 0.9 now to get all the nice improvements since 0.7.

We performed a more thorough sanity check by permuting the matrix sizes in the code Daniël posted above, and we came across another, potentially similar OOB read in a different Haswell kernel:

==274135== Invalid read of size 4
==274135==    at 0x15C8D0: bli_sgemmsup_rv_haswell_asm_6x2m (bli_gemmsup_rv_haswell_asm_s6x16m.c:4157)
==274135==    by 0x15D8A8: bli_sgemmsup_rv_haswell_asm_6x16m (bli_gemmsup_rv_haswell_asm_s6x16m.c:162)
==274135==    by 0x9CF01D: bli_sgemmsup_ref_var2m (bli_l3_sup_var1n2m.c:1322)
==274135==    by 0x9D56DE: bli_gemmsup_ref_var2m (bli_l3_sup_var1n2m.c:858)
==274135==    by 0x9CD24D: bli_gemmsup_int (bli_l3_sup_int.c:217)
==274135==    by 0x9ABBDC: bli_l3_sup_thread_decorator (bli_l3_sup_decor_single.c:113)
==274135==    by 0x9A0EE1: bli_gemmsup_ref (bli_l3_sup_ref.c:109)
==274135==    by 0x122AF7: bli_gemmsup (bli_l3_sup.c:122)
==274135==    by 0x121B44: bli_gemm_ex (bli_l3_oapi_ex.c:70)
==274135==    by 0x10E626: sgemm_ (bla_gemm.c:264)
==274135==    by 0x10DBD9: cblas_sgemm (cblas_sgemm.c:108)
==274135==    by 0x10D6A4: main (test.c:26)
==274135==  Address 0x4bf10e8 is 0 bytes after a block of size 744 alloc'd
==274135==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==274135==    by 0x10D667: main (test.c:24)
==274135==

This read can be consistently reproduced with the following size parameters: M=6, K=512, N=31. blis was built with the same configuration as before.

EDIT: The following patch fixes it.

--- a/kernels/haswell/3/sup/bli_gemmsup_rv_haswell_asm_s6x16m.c
+++ b/kernels/haswell/3/sup/bli_gemmsup_rv_haswell_asm_s6x16m.c
@@ -4502,7 +4502,8 @@ void bli_sgemmsup_rv_haswell_asm_6x2m
 	add(rdi, rcx)
 	
 	
-	vfmadd231ps(mem(rcx, 0*32), xmm3, xmm14)
+	vmovsd(mem(rcx, 0*32), xmm0)
+	vfmadd231ps(xmm0, xmm3, xmm14)
 	vmovsd(xmm14, mem(rcx, 0*32))
 	//add(rdi, rcx)

Thank you, @shadeMe. I will look into this.

@shadeMe Please give 4dde947 a try.

I was doing another sweep test, since we are still encountering issues in Windows CI. I am not sure if this is caused by ASAN tooling, but this error is not triggered by all input sizes:

% ./blis_test
[...]
m: 2, k: 1, n: 4
AddressSanitizer:DEADLYSIGNAL
=================================================================
==223255==ERROR: AddressSanitizer: SEGV on unknown address 0xfffffffffffffff4 (pc 0x7fdfc456af5c bp 0x00000000001c sp 0xfffffffffffffff4 T0)
==223255==The signal is caused by a READ memory access.
AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.

It somehow seem that the stack pointer got corrupted? More info from LLDB:

(lldb) bt
* thread #1, name = 'blis_test', stop reason = signal SIGSEGV: invalid address (fault address: 0xfffffffffffffff4)
  * frame #0: 0x00007ffff6becf5c libblis.so.4`bli_sgemmsup_rv_haswell_asm_2x4(conja=<unavailable>, conjb=<unavailable>, m0=<unavailable>, n0=<unavailable>, k0=<unavailable>, alpha=<unavailable>, a=<unavailable>, rs_a0=<unavailable>, cs_a0=<unavailable>, b=<unavailable>, rs_b0=<unavailable>, cs_b0=<unavailable>, beta=<unavailable>, c=<unavailable>, rs_c0=<unavailable>, cs_c0=<unavailable>, data=<unavailable>, cntx=<unavailable>) at bli_gemmsup_rv_haswell_asm_sMx4.c:2313:1
(lldb) disassemble
[...]
    0x7ffff6becf33 <+1347>: movq   $0x0, (%rax)
    0x7ffff6becf3a <+1354>: vmovups %ymm0, 0x9(%rax)
    0x7ffff6becf3f <+1359>: movq   $0x0, 0x29(%rax)
    0x7ffff6becf47 <+1367>: movl   $0x0, 0x31(%rax)
    0x7ffff6becf4e <+1374>: movw   $0x0, 0x35(%rax)
    0x7ffff6becf54 <+1380>: movb   $0x0, 0x37(%rax)
    0x7ffff6becf58 <+1384>: leaq   -0x28(%rbp), %rsp
->  0x7ffff6becf5c <+1388>: popq   %rbx
    0x7ffff6becf5d <+1389>: popq   %r12
    0x7ffff6becf5f <+1391>: popq   %r13
    0x7ffff6becf61 <+1393>: popq   %r14
    0x7ffff6becf63 <+1395>: popq   %r15
    0x7ffff6becf65 <+1397>: popq   %rbp
    0x7ffff6becf66 <+1398>: vzeroupper 
    0x7ffff6becf69 <+1401>: retq   
[...]
(lldb) register read
General Purpose Registers:
       rax = 0x000010007fff77ec
       rbx = 0x0000000000000014
       rcx = 0x0000603000006170
       rdx = 0x00006020000041b0
       rdi = 0x0000000000000010
       rsi = 0x0000000000000004
       rbp = 0x000000000000001c
       rsp = 0xfffffffffffffff4
        r8 = 0x0000000000000004
        r9 = 0x0000000000000004
       r10 = 0x0000000000000010
       r11 = 0x00007ffff6bec9f0  libblis.so.4`bli_sgemmsup_rv_haswell_asm_2x4 at bli_gemmsup_rv_haswell_asm_sMx4.c:1965
       r12 = 0x00007fffffffc2e0
       r13 = 0x000000000000000c
       r14 = 0x0000000000000000
       r15 = 0x0000000000000014
       rip = 0x00007ffff6becf5c  libblis.so.4`bli_sgemmsup_rv_haswell_asm_2x4 + 1388 at bli_gemmsup_rv_haswell_asm_sMx4.c:2313:1
    rflags = 0x0000000000010202
        cs = 0x0000000000000033
        fs = 0x0000000000000000
        gs = 0x0000000000000000
        ss = 0x000000000000002b
        ds = 0x0000000000000000
        es = 0x0000000000000000

Nevermind, this is probably more related to ASAN. This was with clang, gcc 12.1.1 refuses to compile with ASAN:

In file included from kernels/haswell/3/sup/bli_gemmsup_rd_haswell_asm_s6x16m.c:39:
kernels/haswell/3/sup/bli_gemmsup_rd_haswell_asm_s6x16m.c: In function 'bli_sgemmsup_rd_haswell_asm_6x12m':
./frame/include/bli_x86_asm_macros.h:102:21: error: 'asm' operand has impossible constraints
  102 | #define BEGIN_ASM() __asm__ volatile (
      |                     ^~~~~~~
./frame/include/bli_x86_asm_macros.h:153:21: note: in expansion of macro 'BEGIN_ASM'
  153 | #define begin_asm() BEGIN_ASM()
      |                     ^~~~~~~~~
kernels/haswell/3/sup/bli_gemmsup_rd_haswell_asm_s6x16m.c:801:9: note: in expansion of macro 'begin_asm'
  801 |         begin_asm()
      |         ^~~~~~~~~
compilation terminated due to -Wfatal-errors.

@danieldk So is this not something we need to look into for now?

I have never gotten ASAN to work correctly in the microkernels, but valgrind is another alternative.

@danieldk So is this not something we need to look into for now?

No, sorry for the noise. But I think we are still hitting access violation error on Windows on CI sometimes, but @shadeMe is now doing a sweep of many different matrix sizes with Valgrind.

After merging the changes in 4dde947, a sweep with the following matrix sizes passed without any errors: M <= 20; N <= 1024; K <= 1024. The access violation we were seeing in our Windows CI also appears to be fixed 👍

Thanks again for merging the fixes, @fgvanzee!

You are welcome, @shadeMe! Thank you for your feedback.

Unfortunately, some Thinc/spaCy have encountered segfaults, even with all the patches applied. It seems like there are still kernels that read beyond the end of the array, e.g.

blis/kernels/haswell/3/sup/s6x16/bli_gemmsup_rv_haswell_asm_sMx6.c

Line 3254 in b861c71

vfmadd231ps(mem(rcx, 0*32), ymm3, ymm4)

Maybe this issue can be reopened?

But that function is already disabled under #if 0

Ah, right, we'll try to see if we can reproduce the error outside CI and pinpoint it to a kernel.

Out-of-bounds read when using Haswell kernels