NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unresolved extern function 'cudaLaunchDevice' error while using NVCC 11.x and cub 2.10 with -G

lilohuang opened this issue · comments

Hi @senior-zero @allisonvacanti ,

I hope you are doing well. Today, I encountered a CUDA compilation failure, specifically an "Unresolved extern function 'cudaLaunchDevice'," while using CUDA 11.6.2 SDK in conjunction with thrust/cub 2.1.0 or trunk versions. This issue arose after enabling the device code debugging information (-G). It is worth noting that there are no problems with the default thrust/cub version included in CUDA SDK 11.6.2. Furthermore, I have observed that this particular compiler error does not exist in the CUDA 12.x compiler.

AFAIK, thrust/cub 2.1.0 should be compatible with CUDA 11.0 and above (i.e., https://github.com/NVIDIA/cub#supported-compilers). I have also uploaded a minimal bug reproducer to https://cuda.godbolt.org/z/7j39EToxh. Could you please take a look at it and provide guidance on how to resolve the bug using CUDA SDK 11.6.2 and thrust/cub 2.1.0?

#include <thrust/device_vector.h>
#include <cub/cub.cuh>

static_assert(THRUST_VERSION >= 200100, "thrust version must be >= 2.10");

int main() {
   constexpr const auto num_items = 10;
   thrust::device_vector<int> d_in(num_items, 1);
   thrust::device_vector<int> d_out(num_items);
   size_t temp_storage_bytes = 0;
   cub::DeviceScan::ExclusiveSum(
      nullptr,
      temp_storage_bytes,
      thrust::raw_pointer_cast(d_in.data()),
      thrust::raw_pointer_cast(d_out.data()),
      num_items);
   thrust::device_vector<char> d_temp_storage(temp_storage_bytes);
   cub::DeviceScan::ExclusiveSum(
      thrust::raw_pointer_cast(d_temp_storage.data()),
      temp_storage_bytes,
      thrust::raw_pointer_cast(d_in.data()),
      thrust::raw_pointer_cast(d_out.data()),
      num_items);
   for (auto value : d_out) {
      std::cout << value << std::endl;
   }

   return 0;
}

Hello @lilohuang and thank you for reporting the issue! A bit simpler reproducer with the same behavior:

#include <nv/target>

template <class T> __device__ int device(T);

template <class T>
__device__ int doit_device(T val) {
  return device(val);
}

template <class T>
__host__ int doit_host(T val) {
  return 0;
}

template <class T>
int doit(T val) 
{
  NV_IF_TARGET(NV_IS_HOST,
                (return doit_host(val);),
                (return doit_device(val);));
}

int main() {
  return doit(42);
}

It seems that we have to guard the doit_device in Thrust to be something along the following lines:

template<class K, class... Args>
cudaError_t __device__
doit_device(K k, Args const&... args) const
{
  #ifdef THRUST_RDC_ENABLED
  const size_t size = argument_pack_size(0,args...);
  void *param_buffer = cudaGetParameterBuffer(64,size);
  fill_arguments((char*)param_buffer, 0, args...);
  return launch_device(k, param_buffer);
  #endif
}

Awesome! Thank you @senior-zero. It worked! 💯 Please close this issue after merging NVIDIA/thrust#1939