NaN Residuals with CUDA GMRES+ParILUT

Question

NaN Residuals with CUDA GMRES+ParILUT

iontcheva opened this issue 8 months ago · comments

Hello,

I have been testing ParILUT with GMRES with linear systems extracted from my application.

I am seeing NaN residuals with the CUDA exec on H100. The versions of the exact same code but with OMP and Reference exec converge in 14 GMRES iterations without issues.

The matrix of the linear system is in the 500K range and due to the limitations for the size of files that can be uploaded I have split it into 7 parts.
cis.mtx.gz.part-aa
cis.mtx.gz.part-ab
cis.mtx.gz.part-ac
cis.mtx.gz.part-ad
cis.mtx.gz.part-ae
cis.mtx.gz.part-af
cis.mtx.gz.part-ag

To merge these files into the actual matrix you can use :
cat cis.mtx.gz.part* > cis.mtx.gz

The rhs file is relatively small:
cis_rhs.mtx.gz

I will upload the files in a few submissions that follow.

See the snapshots showing the NaN with the CUDA exec and the runs with omp and reference exec.

iontcheva · Answer 1 · Wed Dec 06 2023 06:09:24 GMT+0800 (China Standard Time)

Part 1 of the matrix
cis.mtx.gz.part-aa.gz

iontcheva · Answer 2 · Wed Dec 06 2023 06:37:23 GMT+0800 (China Standard Time)

I have been debugging the issue and tracked it down to :
ginkgo/common/cuda_hip/factorization/par_ilut_spgeam_kernels.hpp.inc

line 131

lu_cur_val is NaN or INF

I am using a slightly modified version of ilu-preconditioned-solver-example.cpp
for my testing.

These are the parameters that I am using:
auto par_ilu_fact =
gko::factorization::ParIlut<ValueType, IndexType>::build()
.with_iterations(10u)
.with_fill_in_limit(2.0)
.on(exec);

const RealValueType reduction_factor{1e-7};
auto ilu_gmres_factory =
  gmres::build()
        .with_criteria(gko::stop::Iteration::build().with_max_iters(100u),
                       gko::stop::ResidualNorm<ValueType>::build()
                           .with_reduction_factor(reduction_factor))
        .with_generated_preconditioner(ilu_preconditioner)
        .on(exec);

Please let me know whether you can reproduce the issue.

Marcel Koch · Answer 3 · Thu Dec 07 2023 16:50:33 GMT+0800 (China Standard Time)

Hi @iontcheva, the ParILUT can be quite unstable, especially across different executors, since it only computes an approximation of the ILU factorization. How close the approximation is usually depends on the .with_iterations parameter. So, as a first step, maybe try to increase that parameter.

Tobias Ribizel · Answer 4 · Thu Dec 07 2023 17:02:51 GMT+0800 (China Standard Time)

I think what's likely happening here is that while we guard against NaNs/Infs in our asynchronous sweep, they may still come up in the other operations, e.g. from an overflowing value in SpGEMM (which is where the lu_val comes from). Without looking at the specific problem, I'm not sure we can do much about this, the preconditioner may just not work on certain problems.

iontcheva · Answer 5 · Fri Dec 08 2023 02:00:14 GMT+0800 (China Standard Time)

Hi MarcelKoch, upsj,

I did try many options for .with_iterations - including the value 20 which is quite high but the issue is not resolved.

I think there is a bug in the implementation of the CUDA version.

The OMP and Reference versions work perfectly fine as I have shown above so I think the ParILUT algorithm is not the problem.

Regarding the comment from upsj - I have a check if (!is_finite(lu_val) || is_nan(lu_val)) in tri_spgeam_init
after line 118 in ginkgo/common/cuda_hip/factorization/par_ilut_spgeam_kernels.hpp.inc
auto lu_val =
checked_load(lu_vals, lu_begin + lane, lu_end, zero());

and it does not seem to get triggered which I think should mean that the values computed in the CuSparse SpGEMM should be fine.

What gets triggered is a similar check on the value lu_cur_val after line 131.

Did you manage to assemble the matrix that I sent and try the specific example? I do not think without looking at the specific example one can say much anyway.

If you can reproduce the issue though on your side and resolve it I think it would make Ginkgo CUDA much more useful for many applications.

Now one cannot solve anything harder (and all the cases in real applications are of that kind) with a simple ILU(0) type of preconditioner like ParILU one needs more advanced preconditioners like ParILUT.

I am not sure whether this is relevant but just an observation :

After you did the fix with the atomic load_relaxed and store_relaxed in the sweep a few weeks ago I was able to get come of my smaller examples work with Ginkgo CUDA GMRES+ParILUT which was not the case before - I was getting NaNs on all of my examples.

uboats · Answer 6 · Thu Dec 14 2023 18:36:24 GMT+0800 (China Standard Time)

If using ginkgo static lib, then it crashed at
auto par_ilu_fact = gko::factorization::ParIlut
here -> auto par_ilu = gko::share(par_ilu_fact->generate(A))

terminate called after throwing an instance of 'gko::CusparseError'
what(): /tools/ginkgo/ginkgo-git/cuda/base/cusparse_bindings.hpp:524: spgemm_work_estimation: Unknown error
Abort (core dumped)

Marcel Koch · Answer 7 · Thu Dec 14 2023 19:18:45 GMT+0800 (China Standard Time)

@uboats Are you also running this on a H100? I think our cuSPARSE exceptions might be a bit out-dated. In any case, I think this error might be due to insufficient GPU memory for the factorization. I had the same issue on my old personal GPU.

uboats · Answer 8 · Thu Dec 14 2023 19:33:46 GMT+0800 (China Standard Time)

@MarcelKoch yes, H100 (80GB)
for a smaller one (357k dim mat), error is Unrecoverable CUDA error on device 0 in deallocate:69: cudaErrorIllegalAddress: an illegal memory access was encountered

uboats · Answer 9 · Thu Dec 14 2023 19:34:15 GMT+0800 (China Standard Time)

I will try alg 3 for spgemm and see

uboats · Answer 10 · Thu Dec 14 2023 20:32:14 GMT+0800 (China Standard Time)

Tried alg2 and it works. so it's gpu memory issue that alg1 needs too much mem
Not sure whether it's cuda spgemm that cannot return correct error code or ginkgo not correctly translate it?
the error is unknown

uboats · Answer 11 · Thu Dec 14 2023 21:34:08 GMT+0800 (China Standard Time)

for parilut, can we have one more param to choose the alg?