xiaoyeli / superlu

Supernodal sparse direct solver. https://portal.nersc.gov/project/sparse/superlu/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bug report: dereferencing of uninitialized pointers in SuperLU_DIST v7.1.0

peter-zajac opened this issue · comments

Hello, I have been trying to include SuperLU-DIST v7.1.0 as a solver into our FEM package and I stumbled across a problem where the function "dReDistributre_A" tries to dereference several uninitialized/unallocated arrays under certain circumstances.

First of all, I have compiled SuperLU_DIST v7.1.0 with Visual Studio under Windows without ParMETIS or LAPACK or any other third-party library. Although I have not tried to compile SuperLU under Linux yet, I assume that we would get segfaults there, too.

I have managed to boil the problem down to a standalone "minimal" reproduction case with 4 MPI processes. The source code is based on your first "pddrive.c" example, so simply replacing that one source file with the contents of the following file, compiling the example and starting it with 4 MPI processes should trigger the error:
pddrive.txt

Note 1: github wouldn't let me upload files with extension 'c' therefore I have uploaded it as a 'txt' file here
Note 2: the code must be run with 4 MPI processes, because the matrix is hardwired into the code

The function at fault is dReDistribute_A from SRC/pddistribute.c and there are 2 sections of code which are misbehaving:

Starting at line 139 we have:

    /* Allocate space for storing the triplets after redistribution. */
    if ( k ) { /* count can be zero. */
        if ( !(ia = intMalloc_dist(2*k)) )
            ABORT("Malloc fails for ia[].");
        if ( !(aij = doubleMalloc_dist(k)) )
            ABORT("Malloc fails for aij[].");
    }
    ja = ia + k;

In the last line ja = ia + k;, ia is accessed although k might be zero and therefore ia has not been allocated. This issue can be easily fixed by moving the last statement into the block of the above if statement. I have to admit that this is not directly a memory error, because ia is not dereferenced, however, the code still crashes under windows because of an unitialized variable read access.

The next two problems are a inside the for loop starting at line 171, but let's first take a look at an if clause starting at line 156:

      if ( SendCnt ) { /* count can be zero */
          if ( !(index = intMalloc_dist(2*SendCnt)) )
              ABORT("Malloc fails for index[].");
          if ( !(nzval = doubleMalloc_dist(SendCnt)) )
              ABORT("Malloc fails for nzval[].");
      }

As we see, both index and nzval are not allocated if the total send count is zero. Now starting at line 171 we have

      for (i = 0, j = 0, p = 0; p < procs; ++p) {
          if ( p != iam ) {
              ia_send[p] = &index[i];
              i += 2 * nnzToSend[p]; /* ia/ja indices alternate */
              aij_send[p] = &nzval[j];
              j += nnzToSend[p];
          }
      }

Here, both arrays index and nzval are always dereferenced even if there is nothing to send and therefore both arrays are not allocated. Again, both errors can be easily fixed by checking whether there is anything to send beforehand, i.e.:

      for (i = 0, j = 0, p = 0; p < procs; ++p) {
          if ( p != iam ) {
              if(nnzToSend[p] > 0) ia_send[p] = &index[i];
              i += 2 * nnzToSend[p]; /* ia/ja indices alternate */
              if(nnzToSend[p] > 0) aij_send[p] = &nzval[j];
              j += nnzToSend[p];
          }
      }

These three changes seem to fix the bugs, however, I would suggest that someone of the devs takes a closer look at the rest of the dReDistribute_A function to make sure that there are no other potential uninitialized pointer dereferentiations.

Thanks for the detailed diagnosis and example. I cannot reproduce the errors on my MacOS, but your logic is correct, so I made the changes. I tagged a v7.1.1 bug fix version, please try ouy.

Hello, thank you for your quick fix, the problem is solved now with v7.1.1.

Also, I have been thinking about the problem (from an academic point of view) and I have to admit that my original analysis was not quite correct in the sense that there are no pointer dereferentiations, because even expressions like ia_send[p] = &index[i]; boil down to simple pointer arithmetic -- so that's the reason why the code did not cause any segfaults on linux systems. So it seems that Microsoft has simply improved their memory sanitation routines in their runtime libraries, because at the crash, the error message simply complained about unitialized memory being read.

In any case, the problem is solved, so we can close this issue ticket now.