KarypisLab / METIS

METIS - Serial Graph Partitioning and Fill-reducing Matrix Ordering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large 2000x peformance regression from 5.1 to 5.2

FreddieWitherden opened this issue · comments

Using METIS 5.1:

pyfr -p partition 8 -ebalanced -pmetis inc-cylinder.pyfrm foo/
 • Combine mesh parts (0.02s)
 • Construct graph (0.00s)
 • Partition graph (0.01s)
 • Renumber vertices (0.03s)
 • Repartition mesh (0.01s)
 • Write mesh (0.01s)

where the partitioning and renumbering (both of which make calls to METIS_PartGraphRecursive) complete almost immediately. By contrast using METIS 5.2.1:

pyfr -p partition 8 -ebalanced -pmetis inc-cylinder.pyfrm foo/
 • Combine mesh parts (0.01s)
 • Construct graph (0.00s)
 • Partition graph (17.33s)
 • Renumber vertices (7.22s)
 • Repartition mesh (0.01s)
 • Write mesh (0.01s)

where we can see a huge slow down (on the order of ~2000x) for the partition graph portion which makes a single call to METIS_PartGraphRecursive. The inputs are identical in both cases, also reproduced with METIS_PartGraphKway. Also reproduced on both Linux (x86-64) and macOS (AARCH64).

This occurs with all of our grids/meshes. Profiling 5.2.1 with perf record we find:

    17.05%  pyfr      libmetis.so.0                                      [.] libmetis__FM_Mc2WayCutRefine
     8.88%  pyfr      libmetis.so.0                                      [.] libmetis__CreateCoarseGraph
     8.03%  pyfr      libmetis.so.0                                      [.] libmetis__FM_2WayCutRefine
     7.93%  pyfr      libmetis.so.0                                      [.] libmetis__rpqInsert
     5.24%  pyfr      libmetis.so.0                                      [.] libmetis__rpqUpdate
     5.12%  pyfr      libc.so.6                                          [.] random
     4.21%  pyfr      libmetis.so.0                                      [.] libmetis__Compute2WayPartitionParams
     4.21%  pyfr      libmetis.so.0                                      [.] libmetis__rpqGetTop
     4.21%  pyfr      libmetis.so.0                                      [.] libmetis__Match_SHEM
     4.01%  pyfr      libmetis.so.0                                      [.] libmetis__SelectQueue
     2.79%  pyfr      libmetis.so.0                                      [.] libmetis__iset
     2.77%  pyfr      libmetis.so.0                                      [.] libmetis__Project2WayPartition
     2.20%  pyfr      libmetis.so.0                                      [.] libmetis__Match_RM
     1.99%  pyfr      libmetis.so.0                                      [.] libmetis__ComputeLoadImbalanceDiffVe
c
     1.93%  pyfr      libmetis.so.0                                      [.] libmetis__McGeneral2WayBalance
     1.85%  pyfr      libmetis.so.0                                      [.] libmetis__iaxpy
     1.34%  pyfr      libmetis.so.0                                      [.] libmetis__rpqDelete
     1.22%  pyfr      libmetis.so.0                                      [.] libmetis__BucketSortKeysInc

whereas with 5.1 (good) we find:

    10.18%  pyfr      libopenblas64_p-r0-15028c96.3.21.so                [.] blas_thread_server
     9.73%  pyfr      [unknown]                                          [k] 0xffffffff900001a2
     9.30%  pyfr      libc.so.6                                          [.] __sched_yield
     8.22%  pyfr      libpython3.11.so.1.0                               [.] _PyEval_EvalFrameDefault
     1.00%  pyfr      libpython3.11.so.1.0                               [.] 0x0000000000192fb0
     0.96%  pyfr      libpython3.11.so.1.0                               [.] 0x00000000001949c0
     0.80%  pyfr      libmetis.so.0                                      [.] libmetis__FM_Mc2WayCutRefine
     0.59%  pyfr      libpython3.11.so.1.0                               [.] _PyType_Lookup
     0.57%  pyfr      libmetis.so.0                                      [.] libmetis__rpqInsert

where METIS is just a rounding error in the runtime.

Can you share the graphs in Metis format to reproduce this locally?

So I sat down and bisected the git revisions and found the culprit was:

5ba1580

which causes ABI breakage. Without recompilation, any METIS 5.1 application will pass an incorrect options array with 5.2 due to every option past METIS_OPTION_DBGLVL being shifted down by one.

I'll put together a PR later which gives these enum options explicit values so such breakage can be avoided in the future as/when new options are added.

This reordering of options broke using METIS 5.2.1 from MUMPS for me.
They have code like

  MUMPS_INT ncon, edgecut, options[40];
  ierr=METIS_SetDefaultOptions(options);
  options[0]  = 0;
  /* Use 1-based fortran numbering */
  options[17] = 1;
  ncon        = 1;
  ierr = METIS_PartGraphKway(n, &ncon, iptr, jcn,
                             NULL, NULL, NULL,
                             k, NULL, NULL, options,
                             &edgecut, part);

and I got a lot of complaints from Metis about the graph to the log, and then some crash.
Of course, it's not good that the Mumps people assumed that METIS_OPTION_NUMBERING will always be 17 (they even include metis.h), but it seems that it could have been easily avoided in the metis side, too (or could be fixed in 5.2.2).

If I change

@@ -271,12 +271,10 @@ typedef enum {
   METIS_OPTION_IPTYPE,
   METIS_OPTION_RTYPE,
   METIS_OPTION_DBGLVL,
-  METIS_OPTION_NIPARTS,
   METIS_OPTION_NITER,
   METIS_OPTION_NCUTS,
   METIS_OPTION_SEED,
   METIS_OPTION_NO2HOP,
-  METIS_OPTION_ONDISK,
   METIS_OPTION_MINCONN,
   METIS_OPTION_CONTIG,
   METIS_OPTION_COMPRESS,
@@ -285,6 +283,8 @@ typedef enum {
   METIS_OPTION_NSEPS,
   METIS_OPTION_UFACTOR,
   METIS_OPTION_NUMBERING,
+  METIS_OPTION_NIPARTS,
+  METIS_OPTION_ONDISK,
   METIS_OPTION_DROPEDGES,
 
   /* Used for command-line parameter purposes */

Mumps works fine again. (I would have complained there if they had a public issue tracker :))

I am maintaining the conda-forge build of METIS, so when the dust settles here let me know and I can bump the version and/or add a patch.

So I sat down and bisected the git revisions and found the culprit was:

5ba1580

which causes ABI breakage. Without recompilation, any METIS 5.1 application will pass an incorrect options array with 5.2 due to every option past METIS_OPTION_DBGLVL being shifted down by one.

Just for the sake of completeness, that commit is also included in METIS 5.1.1, so even an application built with METIS 5.1.0 will already gave wrong results when used at runtime with METIS 5.1.1 .

For reference, this is moptions_et in METIS 5.1.0 :

/*! Options codes (i.e., options[]) */
typedef enum {
  METIS_OPTION_PTYPE,
  METIS_OPTION_OBJTYPE,
  METIS_OPTION_CTYPE,
  METIS_OPTION_IPTYPE,
  METIS_OPTION_RTYPE,
  METIS_OPTION_DBGLVL,
  METIS_OPTION_NITER,
  METIS_OPTION_NCUTS,
  METIS_OPTION_SEED,
  METIS_OPTION_NO2HOP,
  METIS_OPTION_MINCONN,
  METIS_OPTION_CONTIG,
  METIS_OPTION_COMPRESS,
  METIS_OPTION_CCORDER,
  METIS_OPTION_PFACTOR,
  METIS_OPTION_NSEPS,
  METIS_OPTION_UFACTOR,
  METIS_OPTION_NUMBERING,

  /* Used for command-line parameter purposes */
  METIS_OPTION_HELP,
  METIS_OPTION_TPWGTS,
  METIS_OPTION_NCOMMON,
  METIS_OPTION_NOOUTPUT,
  METIS_OPTION_BALANCE,
  METIS_OPTION_GTYPE,
  METIS_OPTION_UBVEC
} moptions_et;

and this is in METIS 5.1.1 and 5.2.1 :

/*! Options codes (i.e., options[]) */
typedef enum {
  METIS_OPTION_PTYPE,
  METIS_OPTION_OBJTYPE,
  METIS_OPTION_CTYPE,
  METIS_OPTION_IPTYPE,
  METIS_OPTION_RTYPE,
  METIS_OPTION_DBGLVL,
  METIS_OPTION_NIPARTS,
  METIS_OPTION_NITER,
  METIS_OPTION_NCUTS,
  METIS_OPTION_SEED,
  METIS_OPTION_NO2HOP,
  METIS_OPTION_ONDISK,
  METIS_OPTION_MINCONN,
  METIS_OPTION_CONTIG,
  METIS_OPTION_COMPRESS,
  METIS_OPTION_CCORDER,
  METIS_OPTION_PFACTOR,
  METIS_OPTION_NSEPS,
  METIS_OPTION_UFACTOR,
  METIS_OPTION_NUMBERING,
  METIS_OPTION_DROPEDGES,

  /* Used for command-line parameter purposes */
  METIS_OPTION_HELP,
  METIS_OPTION_TPWGTS,
  METIS_OPTION_NCOMMON,
  METIS_OPTION_NOOUTPUT,
  METIS_OPTION_BALANCE,
  METIS_OPTION_GTYPE,
  METIS_OPTION_UBVEC
} moptions_et;

the diff is:

--- 5.1.0
+++ 5.1.1
@@ -6,10 +6,12 @@
   METIS_OPTION_IPTYPE,
   METIS_OPTION_RTYPE,
   METIS_OPTION_DBGLVL,
+  METIS_OPTION_NIPARTS,
   METIS_OPTION_NITER,
   METIS_OPTION_NCUTS,
   METIS_OPTION_SEED,
   METIS_OPTION_NO2HOP,
+  METIS_OPTION_ONDISK,
   METIS_OPTION_MINCONN,
   METIS_OPTION_CONTIG,
   METIS_OPTION_COMPRESS,
@@ -18,6 +20,7 @@
   METIS_OPTION_NSEPS,
   METIS_OPTION_UFACTOR,
   METIS_OPTION_NUMBERING,
+  METIS_OPTION_DROPEDGES,
 
   /* Used for command-line parameter purposes */
   METIS_OPTION_HELP,

For anyone interested, a patch that make mumps 5.2.1 work with metis 5.1.1 and 5.2.1 (but breaking compatibility with metis 5.1.0) that seems to work is available at https://github.com/conda-forge/mumps-feedstock/blob/c524cb3c71686bee59d9b12df5d9d6ce20782ce4/recipe/mumps_support_only_metis_5_1_1.patch .