Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown

Question

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown

opened this issue 5 years ago · comments

For version 5.4, Suitesparse_config.mk recommended openblas. For version 5.6, intel mkl is now recommended and the config file looks like wants to link against intel openmp.

TL;DR is the cholmod performance degradation related to the threading library (pthreads vs GNU openmp vs intel openmp)? Can you explain a little more (or just link to a discussion about the issue). Does this issue affect version 5.4 when linked against openblas/pthreads?

Some background about why I'm curious about which threading library to use:

I am a "causal" user of suitesparse mostly as it can be used by other packages such as Sundials, numpy, scikit-umfpack, sparseqr, scikit-sparse, etc...

I've built my own Suitesparse shared libraries (and the downstream packages mentioned above) on linux systems (mostly ubuntu) for several years now - one set of libraries linked against openblas/pthreads and one linked against mkl/GNU openmp. Mostly I stick with openblas/pthreads as I get sufficient performance boost without the head aches related to trying to understand which thread library to use (or how).

However, after noting the (new) recommendation for Suitesparse 5.6 and subsequently trying to work out what to do about it, I've run across discussions like this or this as well as the intel mkl linking recommendations here.

The hobgoblin in my little mind is now screaming at me to be foolishly consistent and either link Suitesparse and all subsequent downstream packages that depend on it against GNU openmp or against intel openmp - but not both (mixed) and certainly not mixed with pthreads.

Is this consistent with your understanding about threading library choice with respect to Suitesparse?

Tim Davis · Answer 1 · Wed Oct 23 2019 22:01:11 GMT+0800 (China Standard Time)

openBlas causes a 100x slowdown in CHOLMOD, as compared to the non supernodal method and as compared to the supernodal method with the mkl blas. It’s broken

Deleted user · Answer 2 · Wed Oct 23 2019 22:14:56 GMT+0800 (China Standard Time)

Thank you for responding. I understand that it is broken wrt cholmod, I was hoping to learn a little more about why and if it affects my choice of threading library.

That said, I also understand openblas, mkl, threading libraries and my choice of downstream libraries to use with Suitesparse are not really your concern (and I'm certainly able to figure it out for myself) so I will close this.

Thank you for your efforts on Suitesparse and making it available as opensource.

Tim Davis · Answer 3 · Thu Oct 24 2019 22:15:41 GMT+0800 (China Standard Time)

I'd like to keep this as an open issue. SuiteSparse has to play nicely with other libraries, and I don't want its performance to suffer if the wrong mix of libraries is used. I at least need to figure out how to navigation the linking process with multiple threading libraries, and provide some documentation on how end users should link the various libraries.

Kacper Pluta · Answer 4 · Mon Oct 28 2019 21:43:23 GMT+0800 (China Standard Time)

@DrTimothyAldenDavis I think this should be documented in a more visible place. I found this thread by chance and as suggested I recompiled SuiteSparse with MKL, only to find a significant speed improvement. Thanks!

Tim Davis · Answer 5 · Mon Oct 28 2019 23:58:40 GMT+0800 (China Standard Time)

I added a comment to this in the top-level README.md file, in bold. This is a recent discovery. I'm hoping to track down the source of this slowdown, and figure out a workaround. It may be a bug in OpenBLAS, or perhaps it's a library build issue. In the meantime, use the Intel MKL.

Viral B. Shah · Answer 6 · Thu Jan 09 2020 07:36:09 GMT+0800 (China Standard Time)

One thing is to turn off multi-threading in openblas for small sizes with a build flag like GEMM_MULTITHREADING_THRESHOLD=50. Does that fix the issue?

Christoph Grüninger · Answer 7 · Sun May 17 2020 06:56:49 GMT+0800 (China Standard Time)

Can we improve the title? This issue difficult to spot. I was searching for "performance slowdown with multiple threads".

Matthias Kirchhart · Answer 8 · Fri Nov 13 2020 17:47:11 GMT+0800 (China Standard Time)

I don’t know if this is an option, but it seems that BLIS is offering what is needed to remedy this situation:

https://github.com/flame/blis/blob/master/docs/Multithreading.md

They implement a new BLAS-Like interface that allows to specify the number of threads for each call individually. They also seem to offer a BLAS compatibility layer.

Dr. Thomas Orgis · Answer 9 · Sat May 22 2021 18:54:01 GMT+0800 (China Standard Time)

I am considering packaging suitesparse with pkgsrc to be used in octave and scipy. We had a aging prototype package for version 4.0.2 so far that I used for building software for my users. I do not know if anyone would be affected by this bug here, I am not using octave or scipy myself.

Do you have a simple test case that I can run to check if our system (16 cores per node) is affected by the OpenBLAS bug with a current build of OpenBLAS with pthread or OpenMP?

Tim Davis · Answer 10 · Sun May 23 2021 07:03:08 GMT+0800 (China Standard Time)

No, I don't have a test for this bug. I do need to figure out a workaround or at least a test, but I haven't had a chance to do that yet. It's not easy working around bugs in other packages.

…

On Sat, May 22, 2021 at 5:54 AM drhpc ***@***.***> wrote: I am considering packaging suitesparse with pkgsrc to be used in octave and scipy. We had a aging prototype package for version 4.0.2 so far that I used for building software for my users. I do not know if anyone would be affected by this bug here, I am not using octave or scipy myself. Do you have a simple test case that I can run to check if our system (16 cores per node) is affected by the OpenBLAS bug with a current build of OpenBLAS with pthread or OpenMP? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOO7BNBDOXS4TIRIRMDTO6ENJANCNFSM4JEBB25A> .

Dr. Thomas Orgis · Answer 11 · Sun May 23 2021 09:30:30 GMT+0800 (China Standard Time)

As I have no experience with suitesparse, I have to guess … but wouldn't any call to CHOLMOD with some generated test data show the performance degradation when comparing it with differing BLAS implementations (or even just with setting openblas or openmp thread count)? I'd just be happy with some code snipped that triggers the right paths for manual comparison.

It is disheartening to see that this issue is somewhat known on the openblas side and here, but nobody has had the time to address it yet. I am wondering how prevalent the issue is at all.

Also, did this only start to happen with certain versions of suitesparse? Or just when CPU core counts became high enough for people to notice the threading issues with excessive locking?

Tim Davis · Answer 12 · Mon May 24 2021 23:30:18 GMT+0800 (China Standard Time)

I can see about putting a test case together. The issue would not likely arise in dense matrix computations, and I would think that's were a BLAS library gets tested by its developers. The issue would arise within a sparse multifrontal or supernodal method, which has a huge number of "tiny BLAS" calls, and a smaller number of "medium size", smaller still of "big size" and so on. Most high-performance sparse direct methods are like this, so this performance issue would affect most of them.

…

On Sat, May 22, 2021 at 8:30 PM drhpc ***@***.***> wrote: As I have no experience with suitesparse, I have to guess … but wouldn't any call to CHOLMOD with some generated test data show the performance degradation when comparing it with differing BLAS implementations (or even just with setting openblas or openmp thread count)? I'd just be happy with some code snipped that triggers the right paths for manual comparison. It is disheartening to see that this issue is somewhat known on the openblas side and here, but nobody has had the time to address it yet. I am wondering how prevalent the issue is at all. Also, did this only start to happen with certain versions of suitesparse? Or just when CPU core counts became high enough for people to notice the threading issues with excessive locking? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOMHE4Z57KDEBVXSHY3TPBLEHANCNFSM4JEBB25A> .

Dr. Thomas Orgis · Answer 13 · Tue May 25 2021 16:09:21 GMT+0800 (China Standard Time)

That would be really great! I think when we got a self-contained program that shows the issue (not inside some python binding, octave etc.), it is much more likely to get the OpenBLAS people to fix it. So far it is something they generally have on the radar, but have no grip on.

Dr. Thomas Orgis · Answer 14 · Sun May 30 2021 01:52:26 GMT+0800 (China Standard Time)

Nudge from the other side … OpenMathLib/OpenBLAS#1886

I have recently identified a historical oversight where a memory buffer is allocated although it is never used […] speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.

Does that fit the occurence of the issue with SuiteSparse? 3 Years ago?

Tim Davis · Answer 15 · Thu Jun 03 2021 03:39:01 GMT+0800 (China Standard Time)

That sounds about the time that I saw the huge 100x slowdown (it was more than 10x). So perhaps this is fixed now.

…

On Sat, May 29, 2021 at 12:52 PM drhpc ***@***.***> wrote: Nudge from the other side … OpenMathLib/OpenBLAS#1886 <OpenMathLib/OpenBLAS#1886> I have recently identified a historical oversight where a memory buffer is allocated although it is never used […] speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code. Does that fit the occurence of the issue with SuiteSparse? 3 Years ago? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIONVRAX2RRV3KLTMXZDTQESWLANCNFSM4JEBB25A> .

Martin Kroeker · Answer 16 · Thu Jun 03 2021 05:02:44 GMT+0800 (China Standard Time)

Would running the cholmod_demo application with one of the provided test cases from its Matrix directory be expected to reproduce the problem, if it still exists ? Or do you recall it requiring particular conditions or matrix sizes not represented there ?

Tim Davis · Answer 17 · Thu Jun 03 2021 05:38:51 GMT+0800 (China Standard Time)

I doubt it. The matrix would need to be larger. I can't recall the conditions or matrix sizes that led to the 100x slowdown.

…

On Wed, Jun 2, 2021 at 4:02 PM Martin Kroeker ***@***.***> wrote: Would running the cholmod_demo application with one of the provided test cases from its Matrix directory be expected to reproduce the problem, if it still exists ? Or do you recall it requiring particular conditions or matrix sizes not represented there ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOPHZ6AMPAVBAGKMMFDTQ2MAFANCNFSM4JEBB25A> .

Dr. Thomas Orgis · Answer 18 · Wed Jun 09 2021 15:39:41 GMT+0800 (China Standard Time)

We got OpenBLAS people interested, but so far we only have seen things like 20% performance difference compared to MKL.

Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad interaction of OpenMP/pthreads with too many threads?

We need a test case that demonstrates this. Don't we have some SuiteSparse user that would be affected? Is there a forum where you could place a call? For something as strong as the recommendation to stick to MKL here, there needs to be evidence. You know, science and all that;-)

Tim Davis · Answer 19 · Wed Jun 09 2021 23:13:44 GMT+0800 (China Standard Time)

I understand the need for a test case. I will try to dig it up. I saw the 100x slowdown myself, in my own tests, but I need to locate that test case.

…

On Wed, Jun 9, 2021 at 2:39 AM drhpc ***@***.***> wrote: We got OpenBLAS people interested, but so far we only have seen things like 20% performance difference compared to MKL. Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad interaction of OpenMP/pthreads with too many threads? We need a test case that demonstrates this. Don't we have some SuiteSparse user that would be affected? Is there a forum where you could place a call? For something as strong as the recommendation to stick to MKL here, there needs to be evidence. You know, science and all that;-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOI3EEX7DYGRCKKDDC3TR4LEZANCNFSM4JEBB25A> .

Patrick Jaap · Answer 20 · Tue Nov 09 2021 22:04:33 GMT+0800 (China Standard Time)

Hi! I followed this thread and remembered that I posted a hopefully related issue once in the Ubuntu bug tracker:
https://bugs.launchpad.net/ubuntu/+source/suitesparse/+bug/1843277

There is a small test program attached. I just tested it on a Debian 12 machine with libopenblas-dev installed and had a run time factor of about 6 on a 12-thread/6-core CPU. To emphasize: In parallel it takes 6 times longer than sequential!

Maybe this test helps to find the problem? If I remove libopenblas-dev (I guess libatlas-base-dev is used then) the run time is equal both parallel and sequential (beside a 400% vs 100% load).

Martin Kroeker · Answer 21 · Wed Nov 10 2021 05:43:23 GMT+0800 (China Standard Time)

@pjaap thank you very much. Your test case does indeed display pathological behaviour not seen in my earlier CHOLMOD tests with very small and very large matrices. (Now to find where and why threads are spinning instead of doing actual work...)

Martin Kroeker · Answer 22 · Thu Nov 11 2021 03:22:41 GMT+0800 (China Standard Time)

...though absurd slowdown only occurs when a non-OpenMP build of OpenBLAS is paired with SuiteSparse, probably just leading to oversubscription. When both projects are built for OpenMP, it appears to be just an issue of finding the sweet spot between using all or only one core (With the test case on a 6-core Haswell with multithreading disabled, running on all cores takes twice as long as the optimum of running on only two, while a single thread is 20 percent slower. The non-OpenMP OpenBLAS however is ninety times slower on all cores compared to a single one, and does not benefit from multithreading at all.

Tim Davis · Answer 23 · Tue Nov 16 2021 22:57:52 GMT+0800 (China Standard Time)

Thanks for the test case. I'll give it a try. I've also revised the title to this issue.

Giulio Genovese · Answer 24 · Thu Mar 09 2023 07:32:18 GMT+0800 (China Standard Time)

I have encountered the same issue in a completely reproducible way. I used the cholmod_gpu_stats() to display how much time was spent in each function. When CHOLMOD 4.0.3 was linked to MKL I got this:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          856 time   1.6010e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          633 time   9.2912e-04
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          235 time   4.7159e-04
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          223 time   5.6744e-04
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   3.5691e-03 GPU   0.0000e+00 total:   3.5691e-03
assembly time   0.0000e+00    0.0000e+00

When running with the Debian version of SuiteSparse which uses CHOLMOD 3.0.14 and OpenBLAS:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          830 time   3.6409e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          610 time   1.3313e-03
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          232 time   1.8505e-03
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          220 time   1.3902e-01
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   1.4584e-01 GPU   0.0000e+00 total:   1.4584e-01
assembly time   0.0000e+00    0.0000e+00

The TRSM calls are indeed 100x slower. Is there a way to check at runtime what version of BLAS is being used? Is there a way I could code into my software a check for the presence of this problem so that I could inform the user that he should recompile and link CHOLMOD to a different BLAS implementation? The main issue is that this happens with the default Debian version of SuiteSparse.

Martin Kroeker · Answer 25 · Fri Mar 10 2023 18:47:41 GMT+0800 (China Standard Time)

Same reproducer as in the Ubuntu ticket, or something different ? I really, really want to fix this on the OpenBLAS side. (With the original reproducer, at least the current version realizes that it should run only one thread for the TRSM call, but the overhead from having all the other threads sitting around asking for work still makes it 5x slower. Looks to be a design flaw going back to earliest GotoBLAS, I'm experimenting with actually using openblas_set_num_threads(1) internally - which BTW could be called from SuiteSparse-enabled programs as well already if the workload is known to be small)

Giulio Genovese · Answer 26 · Sat Mar 11 2023 00:03:00 GMT+0800 (China Standard Time)

@martin-frbg, this is not the same reproducer. Somehow it seems that all of my matrices are affected by this issue. And I had not understood before that CHOLMOD 4 limits the number of threads for the TRSM calls.

For other users bumping into this issue, CHOLMOD 4 has now two new variables in the cholmod_common structure:

    double chunk ;      // chunksize for computing # of threads to use.
                        // Given nwork work to do, # of threads is
                        // max (1, min (floor (work / chunk), nthreads_max))
    int nthreads_max ;  // max # of threads to use in CHOLMOD.  Defaults to
                        // SUITESPARSE_OPENMP_MAX_THREADS.

Which are initialized by cholmod_start() as:

    // -------------------------------------------------------------------------
    // OpenMP initializations
    // -------------------------------------------------------------------------

    Common->chunk = 128000 ;
    Common->nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;

Together with the new function cholmod_nthreads():

//==============================================================================
//=== openmp support ===========================================================
//==============================================================================

static inline int cholmod_nthreads  // returns # of OpenMP threads to use
(
    double work,                    // total work to do
    cholmod_common *Common
)
{ 
    #ifdef _OPENMP
    double chunk = Common->chunk ;  // give each thread at least this much work
    int nthreads_max = Common->nthreads_max ;   // max # of threads to use
    if (nthreads_max <= 0)
    {
        nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;
    }
    work  = MAX (work, 1) ;
    chunk = MAX (chunk, 1) ;
    int64_t nthreads = (int64_t) floor (work / chunk) ;
    nthreads = MIN (nthreads, nthreads_max) ;
    nthreads = MAX (nthreads, 1) ;
    return ((int) nthreads) ;
    #else
    return (1) ;
    #endif
}

That is now used in t_cholmod_super_numeric.c for the supernodal numeric factorization and the number of threads is then set through:

#pragma omp parallel for num_threads(nthreads)

I am guessing here that CHOLMOD 4 is trying to limit the number of threads for small matrices, to avoid the slowdown issue. Though, for some reasons, this infrastructure is not present in t_cholmod_super_solve.c instead.

So maybe it would be enough to reach out to @svillemot, the maintainer of the suitesparse Debian package, and see if he can update it to a more recent version as the current version is still based on CHOLMOD 3.

Martin Kroeker · Answer 27 · Sat Mar 11 2023 00:16:23 GMT+0800 (China Standard Time)

that may be a different part of the problem, though not directly related to openblas. I'm currently testing with the latest release of Suitesparse built from source, so not affected by any delays in getting new versions of either package into Debian. (btw the openblas provided by debian is possibly a bit older as well)

Sébastien Villemot · Answer 28 · Sat Mar 11 2023 02:05:56 GMT+0800 (China Standard Time)

Debian sid/unstable currently has CHOLMOD 3.0.14 from SuiteSparse 5.12.0. For the time being I cannot update it, because Debian is currently frozen (preparing for the release of Debian “Bookworm” 12).
Regarding OpenBLAS, Debian sid/unstable has version 0.3.21, which is up-to-date.

Tim Davis · Answer 29 · Wed Jan 24 2024 01:01:02 GMT+0800 (China Standard Time)

This issue is likely fixed by OpenMathLib/OpenBLAS#4441 which fixes OpenMathLib/OpenBLAS#2265 and OpenMathLib/OpenBLAS#2846 . The fix will appear on OpenBLAS 0.3.27.