eth-cscs / SpFFT

Sparse 3D FFT library with MPI, OpenMP, CUDA and ROCm support

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using SpFFT inside of an OpenMP region

MRedies opened this issue · comments

I am trying to use SpFFT inside of an OpenMP region and I keep getting segfaults. I modified one of your examples to show the issue. I just want to do the FFT in parallel over several independent frequecy regions:

program main
    use iso_c_binding
    use spfft
    implicit none
    integer :: i, j, k, counter
    integer, parameter :: dimX = 2
    integer, parameter :: dimY = 2
    integer, parameter :: dimZ = 2
    integer, parameter :: maxNumLocalZColumns = dimX * dimY
    integer, parameter :: processingUnit = 1
    integer, parameter :: maxNumThreads = -1
    type(c_ptr) :: grid = c_null_ptr
    type(c_ptr) :: transform = c_null_ptr
    integer :: errorCode = 0
    integer, dimension(dimX * dimY * dimZ * 3):: indices = 0
    complex(C_DOUBLE_COMPLEX), dimension(dimX * dimY * dimZ, 1000):: frequencyElements
    complex(C_DOUBLE_COMPLEX), pointer :: spaceDomain(:,:,:)
    type(c_ptr) :: realValuesPtr


    counter = 0
    do k = 1, dimZ
        do j = 1, dimY
           do i = 1, dimX
             frequencyElements(counter + 1,:) = cmplx(counter, -counter)
             indices(counter * 3 + 1) = i - 1
             indices(counter * 3 + 2) = j - 1
             indices(counter * 3 + 3) = k - 1
             counter = counter + 1
            end do
        end do
    end do

    ! print input
    ! print *, "Input:"
    ! do i = 1, size(frequencyElements)
    !      print *, frequencyElements(i)
    ! end do


    ! create grid and transform

    !$OMP PARALLEL  default(none) &
    !$OMP private(i, errorcode, grid,realValuesPtr, transform)&
    !$OMP shared(indices, frequencyElements)
    errorCode = spfft_grid_create(grid, dimX, dimY, dimZ, maxNumLocalZColumns, processingUnit, maxNumThreads);
    if (errorCode /= SPFFT_SUCCESS) error stop
    errorCode = spfft_transform_create(transform, grid, processingUnit, 0, dimX, dimY, dimZ, dimZ,&
        size(frequencyElements,1), SPFFT_INDEX_TRIPLETS, indices)
    if (errorCode /= SPFFT_SUCCESS) error stop

    ! grid can be safely destroyed after creating all required transforms
    errorCode = spfft_grid_destroy(grid)
    if (errorCode /= SPFFT_SUCCESS) error stop

    ! set space domain array to use memory allocted by the library
    errorCode = spfft_transform_get_space_domain(transform, processingUnit, realValuesPtr)
    if (errorCode /= SPFFT_SUCCESS) error stop

    ! transform backward
    !$OMP DO
    do i=1,1000
        errorCode = spfft_transform_backward(transform, frequencyElements(:,i), processingUnit)
        if (errorCode /= SPFFT_SUCCESS) error stop
    enddo
    !$OMP end do
    !$OMP end PARALLEL
end

If I run it I get a segfault and a core dump. This is the backtrace of the core:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f45cc137859 in __GI_abort () at abort.c:79
#2  0x00007f45cc1a23ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f45cc2cc285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f45cc1aa47c in malloc_printerr (str=str@entry=0x7f45cc2ce278 "malloc_consolidate(): invalid chunk size") at malloc.c:5347
#4  0x00007f45cc1aac58 in malloc_consolidate (av=av@entry=0x7f45a0000020) at malloc.c:4477
#5  0x00007f45cc1ace03 in _int_malloc (av=av@entry=0x7f45a0000020, bytes=bytes@entry=9328) at malloc.c:3699
#6  0x00007f45cc1adc5f in _int_memalign (av=av@entry=0x7f45a0000020, alignment=alignment@entry=32, bytes=bytes@entry=9248) at malloc.c:4684
#7  0x00007f45cc1b051c in _mid_memalign (address=<optimized out>, bytes=9248, alignment=32) at malloc.c:3312
#8  __GI___libc_memalign (alignment=<optimized out>, bytes=9248) at malloc.c:3261
#9  0x00007f45cbf2f3b9 in fftw_malloc_plain () from /lib/x86_64-linux-gnu/libfftw3.so.3
#10 0x00007f45cbf30b4f in ?? () from /lib/x86_64-linux-gnu/libfftw3.so.3
#11 0x00007f45cbf3a3a8 in fftw_kdft_register () from /lib/x86_64-linux-gnu/libfftw3.so.3
#12 0x00007f45cbf33460 in fftw_solvtab_exec () from /lib/x86_64-linux-gnu/libfftw3.so.3
#13 0x00007f45cbf36a3f in fftw_dft_conf_standard () from /lib/x86_64-linux-gnu/libfftw3.so.3
#14 0x00007f45cc00191d in fftw_configure_planner () from /lib/x86_64-linux-gnu/libfftw3.so.3
#15 0x00007f45cc005280 in fftw_the_planner () from /lib/x86_64-linux-gnu/libfftw3.so.3
#16 0x00007f45cc0016ae in fftw_mkapiplan () from /lib/x86_64-linux-gnu/libfftw3.so.3
#17 0x00007f45cc004e07 in fftw_plan_many_dft () from /lib/x86_64-linux-gnu/libfftw3.so.3
#18 0x00007f45cc65a89c in spfft::FFTWPlan<double>::FFTWPlan (this=0x7f45a0001120, input=0x7f45a0001000, output=0x7f45a0001000, size=2, istride=1, ostride=1, idist=2, odist=2, howmany=1, sign=1) at /home/matthias/libraries/SpFFT/src/fft/fftw_plan_1d.hpp:80
#19 0x00007f45cc662fe1 in __gnu_cxx::new_allocator<spfft::FFTWPlan<double> >::construct<spfft::FFTWPlan<double>, std::complex<double>*, std::complex<double>*, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, int&> (this=0x7f45a0000d88, __p=0x7f45a0001120) at /usr/include/c++/9/ext/new_allocator.h:147
#20 0x00007f45cc6615b0 in std::allocator_traits<std::allocator<spfft::FFTWPlan<double> > >::construct<spfft::FFTWPlan<double>, std::complex<double>*, std::complex<double>*, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, int&> (__a=..., __p=0x7f45a0001120) at /usr/include/c++/9/bits/alloc_traits.h:484
#21 0x00007f45cc65fc22 in std::vector<spfft::FFTWPlan<double>, std::allocator<spfft::FFTWPlan<double> > >::emplace_back<std::complex<double>*, std::complex<double>*, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, unsigned long long const&, int&> (this=0x7f45a0000d88) at /usr/include/c++/9/bits/vector.tcc:115
#22 0x00007f45cc65d561 in spfft::Transform1DPlanesHost<double>::Transform1DPlanesHost (this=0x7f45a0000d80, inputData=..., outputData=..., transposeInputData=false, transposeOutputData=false, sign=1, maxNumThreads=12)
    at /home/matthias/libraries/SpFFT/src/fft/transform_1d_host.hpp:111
#23 0x00007f45cc65b38c in spfft::ExecutionHost<double>::ExecutionHost (this=0x7f45a0000f00, numThreads=12, param=std::shared_ptr<class spfft::Parameters> (use count 4, weak count 0) = {...}, array1=..., array2=...)
    at /home/matthias/libraries/SpFFT/src/execution/execution_host.cpp:77
#24 0x00007f45cc668cba in spfft::TransformInternal<double>::TransformInternal (this=0x7f45a0000ec0, executionUnit=SPFFT_PU_HOST, grid=std::shared_ptr<class spfft::GridInternal<double>> (empty) = {...}, param=std::shared_ptr<class spfft::Parameters> (empty) = {...})
    at /home/matthias/libraries/SpFFT/src/spfft/transform_internal.cpp:91
#25 0x00007f45cc66656c in spfft::Transform::Transform (this=0x7f45a0000c00, grid=std::shared_ptr<class spfft::GridInternal<double>> (use count 2, weak count 0) = {...}, processingUnit=SPFFT_PU_HOST, transformType=SPFFT_TRANS_C2C, dimX=2, dimY=2, dimZ=2,
    localZLength=2, numLocalElements=8, indexFormat=SPFFT_INDEX_TRIPLETS, indices=0x55fb923ec040 <indices>) at /home/matthias/libraries/SpFFT/src/spfft/transform.cpp:63
#26 0x00007f45cc66ab35 in spfft::Grid::create_transform (this=0x7f45a0000b60, processingUnit=SPFFT_PU_HOST, transformType=SPFFT_TRANS_C2C, dimX=2, dimY=2, dimZ=2, localZLength=2, numLocalElements=8, indexFormat=SPFFT_INDEX_TRIPLETS, indices=0x55fb923ec040 <indices>)
    at /home/matthias/libraries/SpFFT/src/spfft/grid.cpp:60
#27 0x00007f45cc666a70 in spfft_transform_create (transform=0x7f45cb344e10, grid=0x7f45a0000b60, processingUnit=SPFFT_PU_HOST, transformType=SPFFT_TRANS_C2C, dimX=2, dimY=2, dimZ=2, localZLength=2, numLocalElements=8, indexFormat=SPFFT_INDEX_TRIPLETS,
    indices=0x55fb923ec040 <indices>) at /home/matthias/libraries/SpFFT/src/spfft/transform.cpp:132
#28 0x000055fb923e941f in MAIN__::MAIN__._omp_fn.0 () at test.f90:49
#29 0x00007f45cc31e77e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
#30 0x00007f45cbb55609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#31 0x00007f45cc234103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

If I run valgrind --leak-check=full ./a.out on the code I get:

==106275== Memcheck, a memory error detector
==106275== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==106275== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==106275== Command: ./a.out
==106275==
==106275==
==106275== HEAP SUMMARY:
==106275==     in use at exit: 228,008 bytes in 3,672 blocks
==106275==   total heap usage: 8,379 allocs, 4,707 frees, 3,761,160 bytes allocated
==106275==
==106275== 3,520 bytes in 11 blocks are possibly lost in loss record 150 of 164
==106275==    at 0x483DD99: calloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==106275==    by 0x40149CA: allocate_dtv (dl-tls.c:286)
==106275==    by 0x40149CA: _dl_allocate_tls (dl-tls.c:532)
==106275==    by 0x536A322: allocate_stack (allocatestack.c:622)
==106275==    by 0x536A322: pthread_create@@GLIBC_2.2.5 (pthread_create.c:660)
==106275==    by 0x4BA3DDA: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==106275==    by 0x4B9B8E0: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==106275==    by 0x109358: MAIN__ (test.f90:45)
==106275==    by 0x109391: main (test.f90:3)
==106275==
==106275== 8,488 (16 direct, 8,472 indirect) bytes in 1 blocks are definitely lost in loss record 161 of 164
==106275==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==106275==    by 0x488EA33: spfft_transform_create (transform.cpp:132)
==106275==    by 0x10941E: MAIN__._omp_fn.0 (test.f90:49)
==106275==    by 0x4B9B8E5: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==106275==    by 0x109358: MAIN__ (test.f90:45)
==106275==    by 0x109391: main (test.f90:3)
==106275==
==106275== 93,192 (176 direct, 93,016 indirect) bytes in 11 blocks are definitely lost in loss record 164 of 164
==106275==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==106275==    by 0x488EA33: spfft_transform_create (transform.cpp:132)
==106275==    by 0x10941E: MAIN__._omp_fn.0 (test.f90:49)
==106275==    by 0x4BA377D: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==106275==    by 0x5369608: start_thread (pthread_create.c:477)
==106275==    by 0x4CED102: clone (clone.S:95)
==106275==
==106275== LEAK SUMMARY:
==106275==    definitely lost: 192 bytes in 12 blocks
==106275==    indirectly lost: 101,488 bytes in 2,281 blocks
==106275==      possibly lost: 3,520 bytes in 11 blocks
==106275==    still reachable: 122,808 bytes in 1,368 blocks
==106275==         suppressed: 0 bytes in 0 blocks
==106275== Reachable blocks (those to which a pointer was found) are not shown.
==106275== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==106275==
==106275== For lists of detected and suppressed errors, rerun with: -s
==106275== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)

Thank you for reporting this.
Creation of a grid and transform is not thread-safe at the moment, because FFTW plan creation is not thread-safe and MPI communicators are duplicated for each grid, which requires a fixed order of execution. I will add some notes about thread-safety in the documentation to clarify.
Therefore, I suspect adding a critical section !$omp critical ... !$omp end critical around the creation of the grid and transform handles will solve the issue in your example (since no MPI is used here).

Some general advice on using SpFFT with multiple threads:

  • Create handles before entering a parallel section and store them in an array. This allows for better chance of reuse, and in case you plan to use MPI, automatically ensures a fixed order of grid handle creation (such that internally duplicated MPI communicators are matched correctly between ranks).
  • Set the maximum number of threads per grid to 1 to avoid oversubscribing the CPU
  • If using MPI, ensure thread support is set the MPI_THREAD_MULTIPLE

Thank you for the quick response. I just tested it in my example code and it seemed to work.

Just to clarify:

Create handles before entering a parallel section and store them in an array.

You mean I should create an array of handles, so that each OpenMP thread has it's own handle?

Glad to hear it works.

You mean I should create an array of handles, so that each OpenMP thread has it's own handle?

Yes, you would have to create / store one unique grid and transform handle per thread. The important part is to create new grid for each thread and not create multiple transform handles from the same grid handle.
If you don't plan on reusing grid handles for creating transforms of different sizes, it would be enough to only the store the transform handles and immediately destroy each grid. The resources associated with a grid (memory, communicators) are only released, after all transform handles on the grid are also destroyed (through internal reference counting).