mgpu::mergesort: illegal memory access for >= 1500000000 keys
maltenbergert opened this issue · comments
Context
We're benchmarking the performance of mgpu::mergesort
(and other GPU sorting algorithms). In a nutshell, we generate the data on the host, copy it onto the device, initialize a stream, and measure the pure sort duration.
Example
#include <thrust/device_vector.h>
#include <src/moderngpu/kernel_mergesort.hxx>
#include <time.h>
#include <stdlib.h>
#include <iomanip>
#include <iostream>
#include <algorithm>
#include <chrono>
int main(int argc, char* argv[]) {
const size_t num_elements = std::stoull(argv[1]);
thrust::host_vector<int> host_elements(num_elements);
std::generate(host_elements.begin(), host_elements.end(), rand);
thrust::device_vector<int> elements = host_elements;
cudaSetDevice(0);
cudaDeviceSynchronize();
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
auto t1 = std::chrono::high_resolution_clock::now();
mgpu::standard_context_t context(false, stream);
mgpu::mergesort(static_cast<int*>(thrust::raw_pointer_cast(elements.data())), num_elements, mgpu::less_t<int>(), context);
cudaStreamSynchronize(stream);
std::chrono::duration<double> t2 = std::chrono::high_resolution_clock::now() - t1;
std::cout << num_elements << "," << std::fixed << std::setprecision(9) << t2.count() << "\n";
if (!thrust::is_sorted(elements.begin(), elements.end())) {
std::cout << "Error: Invalid sort order.\n";
}
return 0;
}
We compile the example with nvcc -O3 -std=c++11 --expt-extended-lambda -o mgpu_sort mpgu_sort.cu
and run it with ./mgpu_sort <num_elements>
on two different platforms.
- IBM AC922: 4x NVIDIA Tesla V100 SXM2 32 GB, CUDA 11.2
- NVIDIA DGX A100: 8x NVIDIA A100 SXM4 40 GB, CUDA 11.0
Error
On both platforms, for num_elements
>= 1500000000, we get the following error:
terminate called after throwing an instance of 'mgpu::cuda_exception_t'
what(): an illegal memory access was encountered
Can somebody help?
Hi @maltenbergert,
I think the space complexity of the implemented mergesort is O(nlogn)
. Thus it may need 1500000000 * log(1500000000) * 4 / 1024 / 1024 / 1024 > 100G
memory.