DrTimothyAldenDavis / SuiteSparse

The official SuiteSparse library: a suite of sparse matrix algorithms authored or co-authored by Tim Davis, Texas A&M University.

Home Page:https://people.engr.tamu.edu/davis/suitesparse.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown

opened this issue · comments

For version 5.4, Suitesparse_config.mk recommended openblas. For version 5.6, intel mkl is now recommended and the config file looks like wants to link against intel openmp.

TL;DR is the cholmod performance degradation related to the threading library (pthreads vs GNU openmp vs intel openmp)? Can you explain a little more (or just link to a discussion about the issue). Does this issue affect version 5.4 when linked against openblas/pthreads?

Some background about why I'm curious about which threading library to use:

I am a "causal" user of suitesparse mostly as it can be used by other packages such as Sundials, numpy, scikit-umfpack, sparseqr, scikit-sparse, etc...

I've built my own Suitesparse shared libraries (and the downstream packages mentioned above) on linux systems (mostly ubuntu) for several years now - one set of libraries linked against openblas/pthreads and one linked against mkl/GNU openmp. Mostly I stick with openblas/pthreads as I get sufficient performance boost without the head aches related to trying to understand which thread library to use (or how).

However, after noting the (new) recommendation for Suitesparse 5.6 and subsequently trying to work out what to do about it, I've run across discussions like this or this as well as the intel mkl linking recommendations here.

The hobgoblin in my little mind is now screaming at me to be foolishly consistent and either link Suitesparse and all subsequent downstream packages that depend on it against GNU openmp or against intel openmp - but not both (mixed) and certainly not mixed with pthreads.

Is this consistent with your understanding about threading library choice with respect to Suitesparse?

Thank you for responding. I understand that it is broken wrt cholmod, I was hoping to learn a little more about why and if it affects my choice of threading library.

That said, I also understand openblas, mkl, threading libraries and my choice of downstream libraries to use with Suitesparse are not really your concern (and I'm certainly able to figure it out for myself) so I will close this.

Thank you for your efforts on Suitesparse and making it available as opensource.

I'd like to keep this as an open issue. SuiteSparse has to play nicely with other libraries, and I don't want its performance to suffer if the wrong mix of libraries is used. I at least need to figure out how to navigation the linking process with multiple threading libraries, and provide some documentation on how end users should link the various libraries.

@DrTimothyAldenDavis I think this should be documented in a more visible place. I found this thread by chance and as suggested I recompiled SuiteSparse with MKL, only to find a significant speed improvement. Thanks!

I added a comment to this in the top-level README.md file, in bold. This is a recent discovery. I'm hoping to track down the source of this slowdown, and figure out a workaround. It may be a bug in OpenBLAS, or perhaps it's a library build issue. In the meantime, use the Intel MKL.

One thing is to turn off multi-threading in openblas for small sizes with a build flag like GEMM_MULTITHREADING_THRESHOLD=50. Does that fix the issue?

Can we improve the title? This issue difficult to spot. I was searching for "performance slowdown with multiple threads".

I don’t know if this is an option, but it seems that BLIS is offering what is needed to remedy this situation:

https://github.com/flame/blis/blob/master/docs/Multithreading.md

They implement a new BLAS-Like interface that allows to specify the number of threads for each call individually. They also seem to offer a BLAS compatibility layer.

I am considering packaging suitesparse with pkgsrc to be used in octave and scipy. We had a aging prototype package for version 4.0.2 so far that I used for building software for my users. I do not know if anyone would be affected by this bug here, I am not using octave or scipy myself.

Do you have a simple test case that I can run to check if our system (16 cores per node) is affected by the OpenBLAS bug with a current build of OpenBLAS with pthread or OpenMP?

As I have no experience with suitesparse, I have to guess … but wouldn't any call to CHOLMOD with some generated test data show the performance degradation when comparing it with differing BLAS implementations (or even just with setting openblas or openmp thread count)? I'd just be happy with some code snipped that triggers the right paths for manual comparison.

It is disheartening to see that this issue is somewhat known on the openblas side and here, but nobody has had the time to address it yet. I am wondering how prevalent the issue is at all.

Also, did this only start to happen with certain versions of suitesparse? Or just when CPU core counts became high enough for people to notice the threading issues with excessive locking?

That would be really great! I think when we got a self-contained program that shows the issue (not inside some python binding, octave etc.), it is much more likely to get the OpenBLAS people to fix it. So far it is something they generally have on the radar, but have no grip on.

Nudge from the other side … OpenMathLib/OpenBLAS#1886

I have recently identified a historical oversight where a memory buffer is allocated although it is never used […] speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.

Does that fit the occurence of the issue with SuiteSparse? 3 Years ago?

Would running the cholmod_demo application with one of the provided test cases from its Matrix directory be expected to reproduce the problem, if it still exists ? Or do you recall it requiring particular conditions or matrix sizes not represented there ?

We got OpenBLAS people interested, but so far we only have seen things like 20% performance difference compared to MKL.

Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad interaction of OpenMP/pthreads with too many threads?

We need a test case that demonstrates this. Don't we have some SuiteSparse user that would be affected? Is there a forum where you could place a call? For something as strong as the recommendation to stick to MKL here, there needs to be evidence. You know, science and all that;-)

Hi! I followed this thread and remembered that I posted a hopefully related issue once in the Ubuntu bug tracker:
https://bugs.launchpad.net/ubuntu/+source/suitesparse/+bug/1843277

There is a small test program attached. I just tested it on a Debian 12 machine with libopenblas-dev installed and had a run time factor of about 6 on a 12-thread/6-core CPU. To emphasize: In parallel it takes 6 times longer than sequential!

Maybe this test helps to find the problem? If I remove libopenblas-dev (I guess libatlas-base-dev is used then) the run time is equal both parallel and sequential (beside a 400% vs 100% load).

@pjaap thank you very much. Your test case does indeed display pathological behaviour not seen in my earlier CHOLMOD tests with very small and very large matrices. (Now to find where and why threads are spinning instead of doing actual work...)

...though absurd slowdown only occurs when a non-OpenMP build of OpenBLAS is paired with SuiteSparse, probably just leading to oversubscription. When both projects are built for OpenMP, it appears to be just an issue of finding the sweet spot between using all or only one core (With the test case on a 6-core Haswell with multithreading disabled, running on all cores takes twice as long as the optimum of running on only two, while a single thread is 20 percent slower. The non-OpenMP OpenBLAS however is ninety times slower on all cores compared to a single one, and does not benefit from multithreading at all.

Thanks for the test case. I'll give it a try. I've also revised the title to this issue.

I have encountered the same issue in a completely reproducible way. I used the cholmod_gpu_stats() to display how much time was spent in each function. When CHOLMOD 4.0.3 was linked to MKL I got this:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          856 time   1.6010e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          633 time   9.2912e-04
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          235 time   4.7159e-04
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          223 time   5.6744e-04
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   3.5691e-03 GPU   0.0000e+00 total:   3.5691e-03
assembly time   0.0000e+00    0.0000e+00

When running with the Debian version of SuiteSparse which uses CHOLMOD 3.0.14 and OpenBLAS:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          830 time   3.6409e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          610 time   1.3313e-03
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          232 time   1.8505e-03
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          220 time   1.3902e-01
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   1.4584e-01 GPU   0.0000e+00 total:   1.4584e-01
assembly time   0.0000e+00    0.0000e+00

The TRSM calls are indeed 100x slower. Is there a way to check at runtime what version of BLAS is being used? Is there a way I could code into my software a check for the presence of this problem so that I could inform the user that he should recompile and link CHOLMOD to a different BLAS implementation? The main issue is that this happens with the default Debian version of SuiteSparse.

Same reproducer as in the Ubuntu ticket, or something different ? I really, really want to fix this on the OpenBLAS side. (With the original reproducer, at least the current version realizes that it should run only one thread for the TRSM call, but the overhead from having all the other threads sitting around asking for work still makes it 5x slower. Looks to be a design flaw going back to earliest GotoBLAS, I'm experimenting with actually using openblas_set_num_threads(1) internally - which BTW could be called from SuiteSparse-enabled programs as well already if the workload is known to be small)

@martin-frbg, this is not the same reproducer. Somehow it seems that all of my matrices are affected by this issue. And I had not understood before that CHOLMOD 4 limits the number of threads for the TRSM calls.

For other users bumping into this issue, CHOLMOD 4 has now two new variables in the cholmod_common structure:

    double chunk ;      // chunksize for computing # of threads to use.
                        // Given nwork work to do, # of threads is
                        // max (1, min (floor (work / chunk), nthreads_max))
    int nthreads_max ;  // max # of threads to use in CHOLMOD.  Defaults to
                        // SUITESPARSE_OPENMP_MAX_THREADS.

Which are initialized by cholmod_start() as:

    // -------------------------------------------------------------------------
    // OpenMP initializations
    // -------------------------------------------------------------------------

    Common->chunk = 128000 ;
    Common->nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;

Together with the new function cholmod_nthreads():

//==============================================================================
//=== openmp support ===========================================================
//==============================================================================

static inline int cholmod_nthreads  // returns # of OpenMP threads to use
(
    double work,                    // total work to do
    cholmod_common *Common
)
{ 
    #ifdef _OPENMP
    double chunk = Common->chunk ;  // give each thread at least this much work
    int nthreads_max = Common->nthreads_max ;   // max # of threads to use
    if (nthreads_max <= 0)
    {
        nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;
    }
    work  = MAX (work, 1) ;
    chunk = MAX (chunk, 1) ;
    int64_t nthreads = (int64_t) floor (work / chunk) ;
    nthreads = MIN (nthreads, nthreads_max) ;
    nthreads = MAX (nthreads, 1) ;
    return ((int) nthreads) ;
    #else
    return (1) ;
    #endif
}

That is now used in t_cholmod_super_numeric.c for the supernodal numeric factorization and the number of threads is then set through:

#pragma omp parallel for num_threads(nthreads)

I am guessing here that CHOLMOD 4 is trying to limit the number of threads for small matrices, to avoid the slowdown issue. Though, for some reasons, this infrastructure is not present in t_cholmod_super_solve.c instead.

So maybe it would be enough to reach out to @svillemot, the maintainer of the suitesparse Debian package, and see if he can update it to a more recent version as the current version is still based on CHOLMOD 3.

that may be a different part of the problem, though not directly related to openblas. I'm currently testing with the latest release of Suitesparse built from source, so not affected by any delays in getting new versions of either package into Debian. (btw the openblas provided by debian is possibly a bit older as well)

Debian sid/unstable currently has CHOLMOD 3.0.14 from SuiteSparse 5.12.0. For the time being I cannot update it, because Debian is currently frozen (preparing for the release of Debian “Bookworm” 12).
Regarding OpenBLAS, Debian sid/unstable has version 0.3.21, which is up-to-date.

This issue is likely fixed by OpenMathLib/OpenBLAS#4441 which fixes OpenMathLib/OpenBLAS#2265 and OpenMathLib/OpenBLAS#2846 . The fix will appear on OpenBLAS 0.3.27.