NLAFET / SpLLT

Sparse Cholesky solver implemented with a runtime system

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong signature for spllt_solve() in drivers/spllt_omp_bench

learning-chip opened this issue · comments

Hi @cayrols @flipflapflop , thanks for releasing this code. I saw the paper Parallelization of the solve phase in a task-based Cholesky solver using a sequential task flow model and want to reproduce its results. However there seems to be errors with driver routines. Below are my attempts and reproducible steps.

Problem description

Compiling drivers/spllt_omp_bench.F90 gives the following error:

[ 96%] Building Fortran object CMakeFiles/spllt_omp_bench.dir/drivers/spllt_omp_bench.F90.o
/opt/SpLLT/drivers/spllt_omp_bench.F90:296:39:

  296 |     call spllt_compute_solve_dep(fkeep)
      |                                       1
Error: Missing actual argument for argument 'stat' at (1)
/opt/SpLLT/drivers/spllt_omp_bench.F90:333:55:

  333 |         workspace=workspace, task_manager=task_manager)
      |                                                       1
Error: There is no specific subroutine for the generic 'spllt_solve' at (1)
/opt/SpLLT/drivers/spllt_omp_bench.F90:355:55:

  355 |         workspace=workspace, task_manager=task_manager)
      |                                                       1
Error: There is no specific subroutine for the generic 'spllt_solve' at (1)
make[3]: *** [CMakeFiles/spllt_omp_bench.dir/build.make:63: CMakeFiles/spllt_omp_bench.dir/drivers/spllt_omp_bench.F90.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:102: CMakeFiles/spllt_omp_bench.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:109: CMakeFiles/spllt_omp_bench.dir/rule] Error 2
make: *** [Makefile:118: spllt_omp_bench] Error 2

Attempted fix

The above error is caused by the wrong calling signatures for spllt_compute_solve_dep() and spllt_solve(). The first error is easily fixed by changing call spllt_compute_solve_dep(fkeep) to call spllt_compute_solve_dep(fkeep, stat = st).

The second error is caused by such invalid subroutine calls:

      call spllt_solve(fkeep, options, order, nrhs, sol_computed, info, job=1, &
        workspace=workspace, task_manager=task_manager)
...
      call spllt_solve(fkeep, options, order, nrhs, sol_computed, info, job=2, &
        workspace=workspace, task_manager=task_manager)

spllt_solve() is defined in src/spllt_solve_mod.F90 as:

   interface spllt_solve
      module procedure spllt_solve_one_double
      module procedure spllt_solve_mult_double
      module procedure spllt_solve_mult_double_worker
   end interface
...
subroutine spllt_solve_one_double(fkeep, options, x, job, info)
...
subroutine spllt_solve_mult_double(fkeep, options, nrhs, x, job, info)
...
subroutine spllt_solve_mult_double_worker(fkeep, options, nrhs, x, &
    job, task_manager, info)

Only spllt_solve_mult_double_worker() takes task_manager argument, while none of them takes workspace argument. Also, they don't take order argument (matrix permutation) as called in spllt_omp_bench.F90.

I can correctly compile another script test/test_solve_phasis.F90, so its signature should be correct:

call spllt_solve(fkeep, options, nrhs, sol_computed, &
1, task_manager, info)
call spllt_solve(fkeep, options, nrhs, sol_computed, &
2, task_manager, info)

Thus I change the problematic calls in drivers/spllt_omp_bench.F90 to:

call spllt_solve_mult_double_worker(fkeep, options, nrhs, sol_computed, 1, task_manager, info)
...
call spllt_solve_mult_double_worker(fkeep, options, nrhs, sol_computed, 2, task_manager, info)

Error after fix

Now spllt_omp_bench compiles successfully, but leads to memory error at run-time:

Matrix file                  = matrix.rb
Matrix format                = csc
Number of CPUs               =    1
Block size                   =   16
Supernode amalgamation nemin =   32
Reading...
ok
 [analysis][prune_tree] nth:            1
[>] [spllt_stf_factorize]   setup and activate nodes time:  7.000E-03 s
[>] [spllt_stf_factorize] task insert time:  7.000E-03 s
Allocation of a workspace of size   4.13E+05
 #Subtree :          249
At line 961 of file /opt/SpLLT/src/spllt_solve_dep_mod.F90
Fortran runtime error: Index '1' of dimension 1 of array 'fkeep%sbc' above upper bound of 0

Error termination. Backtrace:
#0  0x7feaafed1d21 in ???
#1  0x7feaafed2869 in ???
#2  0x7feaafed2ee6 in ???
#3  0x55a53008b148 in __spllt_solve_dep_mod_MOD_fwd_update_dependency
	at /opt/SpLLT/src/spllt_solve_dep_mod.F90:961
#4  0x55a53008bd27 in __spllt_solve_dep_mod_MOD_spllt_compute_blk_solve_dep
	at /opt/SpLLT/src/spllt_solve_dep_mod.F90:248
#5  0x55a53008dc70 in __spllt_solve_dep_mod_MOD_spllt_compute_solve_dep
	at /opt/SpLLT/src/spllt_solve_dep_mod.F90:271
#6  0x55a530071020 in MAIN__._omp_fn.1
	at /opt/SpLLT/drivers/spllt_omp_bench.F90:409
#7  0x7feaafd3878d in ???
#8  0x7feaafa6c608 in ???
#9  0x7feaafc30132 in ???
#10  0xffffffffffffffff in ???

Reproducible Dockerfile

To ease reproducibility, here's a Dockerfile to generate the compile error I got:

FROM ubuntu:20.04

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y \
    git wget vim \
    gcc g++ gfortran \
    libblas-dev liblapack-dev \
    libnuma-dev \
    libhwloc-dev \
    libmetis-dev \
    libudev-dev \
    make cmake \
    autoconf pkgconf \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt

# Build Spral
RUN git clone https://github.com/ralna/spral.git \
    && cd spral \
    && ./autogen.sh \
    && CC=gcc CXX=g++ FC=gfortran ./configure \
        --prefix=/opt/spral_install \
        --disable-openmp --disable-gpu \
        --with-metis="-L/usr/lib/x86_64-linux-gnu -lmetis" \
    && make \
    && make install \
    && cp *.mod /opt/spral_install/include/

# Build SpLLT
RUN git clone https://github.com/NLAFET/SpLLT \
    && cd SpLLT \
    && mkdir -p build/build_omp \
    && cd build/build_omp \
    && mkdir log \
    && CC=gcc CXX=g++ FC=gfortran cmake \
        -DRUNTIME=OMP \
        -DSPRAL_LIB=/opt/spral_install/lib \
        -DSPRAL_INC=/opt/spral_install/include \
        -DMETIS_LIB=/usr/lib/x86_64-linux-gnu \
        -DMETIS_INC=/usr/include \
        ../.. 2>&1 | tee log/cmake_spllt_omp.log \
    && make spllt 2>&1 | tee log/make_spllt.log
# `make` or `make all` leads to error at `spllt_omp_bench`

WORKDIR /opt/SpLLT/build/build_omp
RUN make test_solve_phasis 2>&1 | tee log/make_test.log

# prepare test matrix
RUN mkdir /opt/data \
    && cd /opt/data \
    && wget https://suitesparse-collection-website.herokuapp.com/RB/Schmid/thermal1.tar.gz \
    && tar zxvf thermal1.tar.gz

# run test script, success
RUN ln -s /opt/data/thermal1/thermal1.rb matrix.rb \
    && ./test_solve_phasis | 2>&1 tee log/run_test.log

# Build SpLLT driver, get compile error
RUN make spllt_omp_bench 2>&1 | tee log/make_driver.log

Run

docker build -t spllt_debug .
docker run --rm -it spllt_debug

Then various logs will be inside build_omp/log of the container.