LLNL / serac

Serac is a high order nonlinear thermomechanical simulation code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large performance deficit for AMG due to DOF ordering by nodes

zatkins-work opened this issue · comments

The use of the DOF ordering Ordering::byNODES is severely degrading the performance of the algebraic multigrid preconditioners. Depending on the problem, this ordering can increase the number of CG iterations (and thus preconditioner constructions) by 3-5x, often almost doubling the wall clock time as well.

Comparing the number of CG iterations required for tests/solid_nonlinear_solve with the HypreAMG preconditioner:

With Ordering::byNODES (current):

Newton iteration   0 : ||r|| =   5.57193e-06
real energy =  -1.18445e-06, model energy =  -1.19326e-06, cg iter =     352, next tr size =       10, accepting = 1
Newton iteration   1 : ||r|| =      5.29e-05, ||r||/||r_0|| =       9.49401
real energy =  -3.90205e-09, model energy =  -3.90209e-09, cg iter =      14, next tr size =       10, accepting = 1
Newton iteration   2 : ||r|| =   4.06843e-08, ||r||/||r_0|| =    0.00730165
real energy =  -1.48502e-12, model energy =  -1.48501e-12, cg iter =     164, next tr size =       10, accepting = 1
Newton iteration   3 : ||r|| =   1.96807e-09, ||r||/||r_0|| =   0.000353211
[       OK ] SolidMechanics.nonlinear_solve (36576 ms)

With Ordering::byVDIM (changed):

Newton iteration   0 : ||r|| =   5.57193e-06
real energy =  -1.18445e-06, model energy =  -1.19326e-06, cg iter =      57, next tr size =       10, accepting = 1
Newton iteration   1 : ||r|| =   5.29008e-05, ||r||/||r_0|| =       9.49416
real energy =  -3.90352e-09, model energy =  -3.90354e-09, cg iter =      17, next tr size =       10, accepting = 1
Newton iteration   2 : ||r|| =   5.26987e-08, ||r||/||r_0|| =    0.00945789
real energy =  -1.02211e-13, model energy =  -1.02211e-13, cg iter =      27, next tr size =       10, accepting = 1
Newton iteration   3 : ||r|| =   1.64388e-09, ||r||/||r_0|| =   0.000295029
[       OK ] SolidMechanics.nonlinear_solve (21694 ms)

Jacobi performance for comparison (note that with the proper ordering, AMG is faster!):

With Ordering::byNODES (current):

Newton iteration   0 : ||r|| =   5.57193e-06 
real energy =  -1.18445e-06, model energy =  -1.19326e-06, cg iter =    1423, next tr size =       10, accepting = 1
Newton iteration   1 : ||r|| =   5.29004e-05, ||r||/||r_0|| =       9.49409
real energy =  -3.90207e-09, model energy =  -3.90212e-09, cg iter =      83, next tr size =       10, accepting = 1
Newton iteration   2 : ||r|| =   5.16122e-08, ||r||/||r_0|| =     0.0092629
real energy =  -1.49589e-12, model energy =  -1.49588e-12, cg iter =     981, next tr size =       10, accepting = 1
Newton iteration   3 : ||r|| =   1.88543e-09, ||r||/||r_0|| =    0.00033838
[       OK ] SolidMechanics.nonlinear_solve (24073 ms)

With Ordering::byVDIM (changed):

Newton iteration   0 : ||r|| =   5.57193e-06
real energy =  -1.18445e-06, model energy =  -1.19326e-06, cg iter =    1423, next tr size =       10, accepting = 1
Newton iteration   1 : ||r|| =   5.29004e-05, ||r||/||r_0|| =        9.4941
real energy =  -3.90208e-09, model energy =  -3.90212e-09, cg iter =      83, next tr size =       10, accepting = 1
Newton iteration   2 : ||r|| =   5.16123e-08, ||r||/||r_0|| =    0.00926291
real energy =  -1.49589e-12, model energy =  -1.49588e-12, cg iter =     981, next tr size =       10, accepting = 1
Newton iteration   3 : ||r|| =   1.87743e-09, ||r||/||r_0|| =   0.000336944
[       OK ] SolidMechanics.nonlinear_solve (23486 ms)

@tupek2 This may be part of why AMG always ends up slower than Jacobi

Thanks for looking at this-- my understanding was that Hypre originally only supported one DOF ordering option, but that a couple years ago support was added for both (either in Hypre directly, or through inserting an extra permutation in mfem). The fact that one option is almost an order of magnitude slower than the other is certainly surprising to me-- I'll ask some mfem developers for clarification on what might be the cause.

I think that Ordering::byVDIM is actually preferred by Hypre now -- MFEM constructs a permutation matrix if Ordering::byNODES is used.

I assume that the slowdown is a bug due to the permutation being wrong/not properly applied when constructing the near-null space of the operator.

This seems like a pretty significant development. If byVDIM is best practice now, LiDO may need to update its examples, tests and documentation.

@samuelpmishLLNL Here's the results without changing the amg_prec->SetSystemOptions line (except for the boolean order_bynodes arg):

With byNODES:

Newton iteration   0 : ||r|| =   5.57193e-06
real energy =  -1.18445e-06, model energy =  -1.19326e-06, cg iter =     588, next tr size =       10, accepting = 1
Newton iteration   1 : ||r|| =   5.28999e-05, ||r||/||r_0|| =       9.49401
real energy =  -3.90206e-09, model energy =   -3.9021e-09, cg iter =      20, next tr size =       10, accepting = 1
Newton iteration   2 : ||r|| =   4.72796e-08, ||r||/||r_0|| =    0.00848532
real energy =   -1.4618e-12, model energy =  -1.46179e-12, cg iter =     275, next tr size =       10, accepting = 1
Newton iteration   3 : ||r|| =   1.93402e-09, ||r||/||r_0|| =   0.000347101
[       OK ] SolidMechanics.nonlinear_solve (29099 ms)

With byVDIM:

Newton iteration   0 : ||r|| =   5.57193e-06
real energy =  -1.18445e-06, model energy =  -1.19326e-06, cg iter =      42, next tr size =       10, accepting = 1
Newton iteration   1 : ||r|| =   5.29002e-05, ||r||/||r_0|| =       9.49405
real energy =  -3.90347e-09, model energy =   -3.9035e-09, cg iter =      13, next tr size =       10, accepting = 1
Newton iteration   2 : ||r|| =   5.15543e-08, ||r||/||r_0|| =     0.0092525
real energy =   -4.5598e-14, model energy =   -4.5598e-14, cg iter =      16, next tr size =       10, accepting = 1
Newton iteration   3 : ||r|| =   1.33928e-09, ||r||/||r_0|| =   0.000240362
[       OK ] SolidMechanics.nonlinear_solve (20307 ms)

Note that this is faster overall due to the preconditioner being cheaper to construct, but has far more iterations in the byNODES case.