moderngpu / moderngpu

Patterns and behaviors for GPU computing

Home Page:http://moderngpu.github.io/moderngpu

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mgpu::mergesort: illegal memory access for >= 1500000000 keys

maltenbergert opened this issue · comments

Context

We're benchmarking the performance of mgpu::mergesort (and other GPU sorting algorithms). In a nutshell, we generate the data on the host, copy it onto the device, initialize a stream, and measure the pure sort duration.

Example

#include <thrust/device_vector.h>
#include <src/moderngpu/kernel_mergesort.hxx>

#include <time.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <algorithm>
#include <chrono>

int main(int argc, char* argv[]) {
  const size_t num_elements = std::stoull(argv[1]);

  thrust::host_vector<int> host_elements(num_elements);
  std::generate(host_elements.begin(), host_elements.end(), rand);
  thrust::device_vector<int> elements = host_elements;

  cudaSetDevice(0);
  cudaDeviceSynchronize();
  cudaStream_t stream;
  cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

  auto t1 = std::chrono::high_resolution_clock::now();
  mgpu::standard_context_t context(false, stream);
  mgpu::mergesort(static_cast<int*>(thrust::raw_pointer_cast(elements.data())), num_elements, mgpu::less_t<int>(), context);
  cudaStreamSynchronize(stream);
  std::chrono::duration<double> t2 = std::chrono::high_resolution_clock::now() - t1;

  std::cout << num_elements << "," << std::fixed << std::setprecision(9) << t2.count() << "\n";

  if (!thrust::is_sorted(elements.begin(), elements.end())) {
    std::cout << "Error: Invalid sort order.\n";
  }

  return 0;
}

We compile the example with nvcc -O3 -std=c++11 --expt-extended-lambda -o mgpu_sort mpgu_sort.cu and run it with ./mgpu_sort <num_elements> on two different platforms.

  • IBM AC922: 4x NVIDIA Tesla V100 SXM2 32 GB, CUDA 11.2
  • NVIDIA DGX A100: 8x NVIDIA A100 SXM4 40 GB, CUDA 11.0

Error

On both platforms, for num_elements >= 1500000000, we get the following error:

terminate called after throwing an instance of 'mgpu::cuda_exception_t'
  what():  an illegal memory access was encountered

Can somebody help?

Hi @maltenbergert,

I think the space complexity of the implemented mergesort is O(nlogn). Thus it may need 1500000000 * log(1500000000) * 4 / 1024 / 1024 / 1024 > 100G memory.