celerity / ndzip

A High-Throughput Parallel Lossless Compressor for Scientific Data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cuda-memcheck: initcheck and racecheck failed

annymao opened this issue · comments

Hi,
I have run cuda-memcheck on this project, but it showed errors on initcheck while compressing data and warnings on racecheck while decompressing data. I was running on Tesla V100-SXM2(32GB) and A100-SXM4(40GB) with cuda-11.2. I'm wondering whether it is a false alarm or not. Below are the examples of the errors and warnings and how I removed them.

The initcheck error can be removed by adding CHECKED_CUDA_CALL(cudaMemset, _memory, 0,size * sizeof(T)); in cuda_bits.cuh (line 196 and 216).
Here is an example of initcheck error:

========= Uninitialized __global__ memory read of size 4
=========     at 0x000000e0 in void ndzip::detail::gpu_cuda::hierarchical_inclusive_scan_reduce<unsigned int, ndzip::detail::gpu_cuda::plus<unsigned int>>(unsigned int*, unsigned int, unsigned int)
=========     by thread (64,0,0) in block (0,0,0)
=========     Address 0x7fffcfe20d00
=========     Device Frame:void ndzip::detail::gpu_cuda::hierarchical_inclusive_scan_reduce<unsigned int, ndzip::detail::gpu_cuda::plus<unsigned int>>(unsigned int*, unsigned int, unsigned int) (void ndzip::detail::gpu_cuda::hierarchical_inclusive_scan_reduce<unsigned int, ndzip::detail::gpu_cuda::plus<unsigned int>>(unsigned int*, unsigned int, unsigned int) : 0xe0)
=========     Saved host backtrace up to driver entry point 
=========     Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2b8) [0x222dc8]
=========     Host Frame:${prefix_path}/ndzip/build/libndzip-cuda.so [0x277bb]
=========     Host Frame:${prefix_path}/ndzip/build/libndzip-cuda.so [0x74270]
=========     Host Frame:${prefix_path}/ndzip/build/libndzip-cuda.so (_ZN5ndzip6detail8gpu_cuda27hierarchical_inclusive_scanIjNS1_4plusIjEEEEvPT_RSt6vectorINS1_11cuda_bufferIS5_EESaIS9_EEjT0_P11CUstream_st + 0x1ed) [0xc91d]
=========     Host Frame:${prefix_path}/ndzip/build/libndzip-cuda.so (_ZN5ndzip4cuda14compress_asyncIfLj1EEEvNS_5sliceIKT_XT0_EEEPvPmRNS0_25compressor_scratch_memoryIS3_XT0_EEEP11CUstream_st + 0x14b) [0xcc6b]
=========     Host Frame:${prefix_path}/ndzip/build/libndzip-cuda.so (_ZNK5ndzip12cuda_encoderIfLj1EE8compressERKNS_5sliceIKfLj1EEEPvPNSt6chrono8durationImSt5ratioILl1ELl1000000000EEEE + 0x3ee) [0xefae]
=========     Host Frame:./build/compress [0x11741]
=========     Host Frame:./build/compress [0x12451]
=========     Host Frame:./build/compress [0xb1b6]
=========     Host Frame:./build/compress [0xab27]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:./build/compress [0xaec3]
=========

Here is the code change

--- ndzip/src/ndzip/cuda_bits.cuh	
+++ ndzip/src/ndzip/cuda_bits.cuh	
@@ -193,6 +193,7 @@
 
     explicit cuda_buffer(index_type size) : _size(size) {
         CHECKED_CUDA_CALL(cudaMalloc, &_memory, size * sizeof(T));
+        CHECKED_CUDA_CALL(cudaMemset, _memory, 0,size * sizeof(T));//<---ADD    
     }
 
     cuda_buffer(cuda_buffer &&other) noexcept {
@@ -212,6 +213,7 @@
     void allocate(index_type size) {
         reset();
         CHECKED_CUDA_CALL(cudaMalloc, &_memory, size * sizeof(T));
+        CHECKED_CUDA_CALL(cudaMemset, _memory, 0,size * sizeof(T));//<---ADD
         _size = size;
     }
 

For the racecheck, warnings can be removed by adding __syncwarp() in cuda_encoder.inl after line 355. I think the warnings are because that some threads might enter line 323 in the second iteration while the other threads are still at line 345 in the first iteration after the __syncwarp() in line 326 ?

Here is the example of racecheck warning:

========= WARN: Race reported between Read access at 0x00000f20 in void ndzip::detail::gpu_cuda::decompress_block<ndzip::detail::profile<float, unsigned int=1>>(floatbits_type const *, ndzip::slice<ndzip::detail::gpu_cuda::decompress_block<ndzip::detail::profile<float, unsigned int=1>::data_type>, __scope__(dimensions)>)
=========     and Write access at 0x00000b30 in void ndzip::detail::gpu_cuda::decompress_block<ndzip::detail::profile<float, unsigned int=1>>(floatbits_type const *, ndzip::slice<ndzip::detail::gpu_cuda::decompress_block<ndzip::detail::profile<float, unsigned int=1>::data_type>, __scope__(dimensions)>) [21864 hazards]
=========

Here is the code change

--- ndzip/src/ndzip/cuda_encoder.inl
+++ ndzip/src/ndzip/cuda_encoder.inl
@@ -353,6 +353,7 @@
                 __builtin_memcpy(&row_bits, row, sizeof row_bits);
                 hc.store(item, row_bits);
             }
+            __syncwarp();//<-------ADD
         } else {
             // TODO duplication of the `item` calculation above. The term can be simplified!
             for (index_type w = 0; w < warps_per_col_chunk; ++w) {

Thanks!