NVIDIA / MatX

I have a small project in which I try to use MatX. right now I basically just create

auto coords = matx::make_tensor<float, 1>({nupts});
auto randCoordsOp = matx::random<hasty::cu_f32>(coords.Shape(), matx::UNIFORM);
(coords[0] = -3.141592f + 2*3.141592f*randCoordsOp).run();

at .run() I hit the following assert
MATX_ASSERT_STR(false, matxInvalidParameter, "Cannot call device executor using host compiler");
I am a bit confused by this. CUDACC mustn't be defined then. What am I doing wrong? I see that you don't recommend clang (which I am using) could it be due to this? Also I see there is a pull request for clang support #485

Hi, are you trying to run this code on the CPU or GPU? Can you pass the compilation line?

Clang support for the host compiler should work. There's a separate request to have host device compilation working, but that isn't ready yet and isn't typically what people want.

I see, Im building with cmake
`cmake_minimum_required(VERSION 3.28)

project("HastyCuCompute" VERSION 0.1)

include(cmake/CPM.cmake)

set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

add_executable(HastyCuCompute "lib/main.cpp")

set_property(TARGET HastyCuCompute PROPERTY CXX_STANDARD 23)
set_property(TARGET HastyCuCompute PROPERTY CMAKE_CXX_STANDARD_REQUIRED ON)

target_sources(HastyCuCompute
PUBLIC FILE_SET CXX_MODULES FILES
"lib/tensor.ixx"
"lib/nufft.ixx"
)

set(CMAKE_VERBOSE_MAKEFILE ON)

find_package(CUDAToolkit REQUIRED)

find_package(matx CONFIG REQUIRED)
target_link_libraries(HastyCuCompute PRIVATE matx::matx)

target_link_libraries(HastyCuCompute PRIVATE CUDA::toolkit)
target_link_libraries(HastyCuCompute PRIVATE CUDA::nvrtc)
target_link_libraries(HastyCuCompute PRIVATE CUDA::cudart)

find_library(finufft REQUIRED
NAMES
libfinufft finufft
HINTS
"${finufft_ROOT}/lib"
)
find_library(cufinufft REQUIRED
NAMES
libcufinufft cufinufft
HINTS
"${finufft_ROOT}/lib"
)

message(${finufft})
message(${cufinufft})
#message(${finufft_INCLUDE_DIR})

target_link_libraries(HastyCuCompute PRIVATE ${finufft})
target_link_libraries(HastyCuCompute PRIVATE ${cufinufft})
target_include_directories(HastyCuCompute PRIVATE "${finufft_ROOT}/include")

set_property(TARGET HastyCuCompute PROPERTY CUDA_SEPARABLE_COMPILATION ON)`

presets are

"cacheVariables": { "CMAKE_INSTALL_PREFIX": "${sourceDir}/out/install/${presetName}", "CMAKE_C_COMPILER": "/usr/bin/clang-18", "CMAKE_CXX_COMPILER": "/usr/bin/clang++-18", "CMAKE_CUDA_ARCHITECTURES": "89", "CUDAToolkit_ROOT": "$env{CUDA_ROOT}/bin/", "finufft_ROOT": "$env{MY_INSTALL_PATH}/finufft/", "CMAKE_PREFIX_PATH": { "type": "FILEPATH", "value": "$env{MY_INSTALL_PATH}/MatX/lib/cmake" }, "CMAKE_TOOLCHAIN_FILE": { "type": "FILEPATH", "value": "$env{VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" } }

Hi @turbotage, in general that means whatever your code was doing was trying to call a device (GPU) executor with a host compiler. It's equivalent to trying to use the CUDA <<<>>> syntax with the host compiler, which is not supported. It's possible there's a bug if you're not on the latest version since we fixed something to do with that.

From your code above I noticed two things:

You're using random(), which currently is only implemented on the device using cuRAND. If you're trying to get this to work in CPU code, can you open a feature request to add CPU support for random()?
You use run() with no arguments, which defaults to the CUDA executor (ie device code). If you're trying to run on the host you need a host executor.

Let me know what your intent is and we can figure it out.

I realize now that I perhaps missed the intent of this library. I thaught operations could be "recorded" in host code then fused to device kernels and executed on device. But I see know that the device executor requires compilation to be cuda guided (CUDACC) and the HostExecutor doesn't run on device. Is there any plans to support this in the future?

Hi @turbotage, you can think of these expressions as a single syntax that will allow you to fuse the expression into either one host function or one device kernel. As an example:

MatX/test/00_operators/OperatorTests.cu

Line 469 in 389ee69

(tov0 = angle(tiv0)).run(exec);

On this line we pass in exec which in this test uses both a host and cuda/device executor separately. In both cases the expression is fused into a single function or kernel, regardless of how long or complicated the expression is.

So I think it already does what you want, but right now not all the functions are supported on the host. Specifically complex things like fft and matmul. Support is coming soon for the host on that .

@turbotage does that last comment help? In summary I think the library does what you want, but possibly not in the way you expected.

Closing. Reopen if this is still an issue.

[QST] Cannot call device executor using host compiler