API remoting using the `onnx_dump` or `cudart` fails because the cuda program on the guest calls the API `cuGetExportTable` which is not implemented.

Question

API remoting using the `onnx_dump` or `cudart` fails because the cuda program on the guest calls the API `cuGetExportTable` which is not implemented.

Abhishekghosh1998 opened this issue a year ago · comments

A note: I exactly don't know whether this issue should be categorized as a bug. My setup steps might be wrong as well. If it is the later situation, please guide me accordingly.

Description of the situation
When I try API remoting for a simple CUDA program compiled using nvcc, the program fails during the remoting step.
On the guestlib side, the CUDA program hits cuGetExportTable function; the guest program aborts, and the API server on the host exits.

$ cat toy.cu
/************************************************************toy.cu********************************************************/
#include <cuda.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#define BLOCK_SIZE 128

__global__
void do_something(float* d_array)
{
    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    d_array[idx]*=100;
}
int main()
{
    long N= 1<<7;
    float *arr = (float*) malloc(N*sizeof(float));
    long i;
    for (i=1;i<=N;i++)
        arr[i-1]=i;
    
    float *d_array;
    cudaError_t ret;
    
    ret = cudaMalloc(&d_array, N*sizeof(float));
    printf("Return value of cudaMalloc = %d\n", ret);
    
    if(ret != cudaSuccess)
    {
	fprintf(stderr,"GPUassert: %s\n", cudaGetErrorString(ret));
	exit(1);
    }

    ret = cudaMemcpy(d_array, arr, N*sizeof(float), cudaMemcpyHostToDevice);
    printf("Return value of cudaMemcpy = %d\n", ret);

    if(ret != cudaSuccess)
    {
	fprintf(stderr,"GPUassert: %s \n", cudaGetErrorString(ret));
	exit(1);
    }

    int num_blocks= (N+BLOCK_SIZE-1)/BLOCK_SIZE;
    do_something<<<num_blocks, BLOCK_SIZE>>>(d_array);

    ret = cudaMemcpy(arr, d_array, N*sizeof(float), cudaMemcpyDeviceToHost);
    printf("Return value of cudaMemcpy = %d\n", ret);

    int j;
    for(i=0;i<N;)
    {
        for(j=0;j<8;j++)
                printf("%.0f\t", arr[i++]);
        printf("\n");
    }
    cudaFree(d_array);
    return 0;
}

$ nvcc -o toy toy.cu

On the guest side:

$ ./toy
To check the state of AvA remoting progress, use `tail -f /tmp/fileKVRWBb`
Connect target API server (10.192.34.20:4000) at 10.192.34.20:4000
<000> <thread=7fc752c7aa00> cuDriverGetVersion(driverVersion=ptr 0x0000000000a0 = {10010,...}) -> 0
<001> <thread=7fc752c7aa00> cuInit() -> 0
toy: /media/hdd/abhishek/ava_verbose_new/cava/cudart_nw/cudart_nw_guestlib.cpp:28216: cuGetExportTable: Assertion `Unsupported API function: cuGetExportTable' failed.
Aborted (core dumped)
$

On the host side (The outputs might be a bit different than usual, because I have added few print statements for better understandability):

./install/bin/demo_manager --worker_path install/onnx_dump/bin/worker
Manager Service listening on ::3333
Receive connection from 172.17.0.2:44782
[from 172.17.0.2:44782] Request 1 GPUs
Spawn API server at 0.0.0.0:4000 (cmdline="CUDA_VISIBLE_DEVICES=0 AVA_CHANNEL=TCP 4000")
worker.cpp::init_worker
worker.cpp::__handle_command_onnx_dump_init
[worker#4000] To check the state of AvA remoting progress, use `tail -f /tmp/filevIxn6v`
[4000] Waiting for guestlib connection
[4000] Accept guestlib with API_ID=a
worker.cpp::__handle_command_onnx_dump
worker.cpp::__wrapper_cuDriverGetVersion
worker.cpp::__handle_command_onnx_dump
worker.cpp::__wrapper_cuInit
return value of __wrapper_cuInit: 0
[pid=69137] API server at ::4000 has exit (waitpid=-1)

To Reproduce
I'll go ahead and describe how I set up AvA.
First, I installed NVIDIA driver 418.226.00 using the NVIDIA-Linux-x86_64-418.226.00.run from the NVIDIA website.
Second, I installed CUDA Toolkit 10.1 using the cuda_10.1.168_418.67_linux.run from the NVIDIA website.
Third, I install cudnn 7.6.3.30 using the following files:

libcudnn7_7.6.3.30-1+cuda10.1_amd64.deb      
libcudnn7-doc_7.6.3.30-1+cuda10.1_amd64.deb
libcudnn7-dev_7.6.3.30-1+cuda10.1_amd64.deb

Next, I forked the AvA repository.
I modified the ava/guestlib/cmd_channel_socket_tcp.cpp to connect to my host using it's IP address.

And then did the following:

$ ava
$ ./generate -s onnx_dump
$ cd ..
$ mkdir build
$ cd build
$ cmake ../ava
$ ccmake . # and then selected the options for onnx_dump and demo manager
$ make -j72
$ make install

Then I used a CUDA-10.1 docker image (the one given this repository under tools/docker, with a bit of modification to remove the issue of cuda keys for apt update)
Bind mounted my build directory in the docker container and then copied the libguestlib.so from the build directory to /usr/lib/x86_64-linux-gnu in the docker container. And modified the library symlinks accordingly:

x86_64-linux-gnu$ ls -lh libcu*
lrwxrwxrwx 1 root root   17 Feb 25  2019 libcublasLt.so -> libcublasLt.so.10
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcublasLt.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  36M Feb 25  2019 libcublasLt.so.10.1.0.105
-rw-r--r-- 1 root root  23M Feb 25  2019 libcublasLt_static.a
lrwxrwxrwx 1 root root   15 Feb 25  2019 libcublas.so -> libcublas.so.10
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcublas.so.10 -> libguestlib.so
-rw-r--r-- 1 root root  75M Feb 25  2019 libcublas.so.10.1.0.105
-rw-r--r-- 1 root root  87M Feb 25  2019 libcublas_static.a
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcudart.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcudart.so.10.1 -> libguestlib.so
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcuda.so -> libguestlib.so
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcuda.so.1 -> libguestlib.so
-rwxr-xr-x 1 root root  16M Sep  2 13:03 libcuda.so.418.226.00
lrwxrwxrwx 1 root root   29 Mar  7  2019 libcudnn.so -> /etc/alternatives/libcudnn_so
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcudnn.so.7 -> libguestlib.so
-rw-r--r-- 1 root root 382M Feb 15  2019 libcudnn.so.7.5.0
lrwxrwxrwx 1 root root   32 Mar  7  2019 libcudnn_static.a -> /etc/alternatives/libcudnn_stlib
-rw-r--r-- 1 root root 351M Feb 15  2019 libcudnn_static_v7.a
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcufft.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root   23 Apr  6  2018 libcupsfilters.so.1 -> libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root 211K Apr  6  2018 libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root  34K Dec 12  2018 libcupsimage.so.2
-rw-r--r-- 1 root root 558K Dec 12  2018 libcups.so.2
lrwxrwxrwx 1 root root   14 Sep  4 05:33 libcurand.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root   19 Jan 29  2019 libcurl-gnutls.so.3 -> libcurl-gnutls.so.4
lrwxrwxrwx 1 root root   23 Jan 29  2019 libcurl-gnutls.so.4 -> libcurl-gnutls.so.4.5.0
-rw-r--r-- 1 root root 499K Jan 29  2019 libcurl-gnutls.so.4.5.0
lrwxrwxrwx 1 root root   16 Jan 29  2019 libcurl.so.4 -> libcurl.so.4.5.0
-rw-r--r-- 1 root root 507K Jan 29  2019 libcurl.so.4.5.0
lrwxrwxrwx 1 root root   12 May 23  2018 libcurses.a -> libncurses.a
lrwxrwxrwx 1 root root   13 May 23  2018 libcurses.so -> libncurses.so
lrwxrwxrwx 1 root root   14 Sep  4 05:34 libcusolver.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root   14 Sep  4 05:34 libcusparse.so.10 -> libguestlib.so

Added the guest config in the docker container as:

$ cat /etc/ava/guest.conf 
channel = "TCP";
manager_address = "10.192.34.20:3333";
gpu_memory = [1024L];

Then I tried to launch the manger on the host as follows:

build$ ./install/bin/demo_manager --worker_path install/onnx_dump/bin/worker
Manager Service listening on ::3333

And on the guest, I try to run the toy cuda program. But it fails as described earlier.

Expected behavior
The cudadrv API remoting works fine for the rodinia benchmarks shared here: https://github.com/utcs-scea/ava-benchmarks/tree/master/rodinia/cuda
But neither the onnx_dump nor the cudart work.

Environment:

OS: Ubuntu 18.04.6 LTS x86_64
Python version: 3.6.9
GCC version: 7.5.0
Kernel: 5.4.0-150-generic
Host: SYS-7049GP-TRT 0123456789
CPU: Intel Xeon Gold 6140 (72) @ 3.700GHz
GPU: NVIDIA Tesla P40
NVIDIA Driver Version: 418.226.00
CUDA Version: 10.1

Hangchen YU · Answer 1 · Mon Sep 18 2023 14:59:19 GMT+0800 (China Standard Time)

Hi- Sorry for the late reply. This is a known issue. We haven't supported cuGetExportTable, because it's a non-public API and we don't yet understand its syntax. cuGetExportTable is usually called by some cudnn, cublas APIs, iirc, but it looks odd to me that your sample program also triggers the call.

Abhishek Ghosh · Answer 2 · Mon Sep 18 2023 16:24:36 GMT+0800 (China Standard Time)

@yuhc Thanks for your reply.

I figured out the issue that I was facing with cuGetExportTable. It was caused by a setup configuration actually. I was compiling the programs with the nvcc command simply as follows:

nvcc -o toy toy.cu

The above command statically linked the libcudart with the executable. As a result, the interception happened at the CUDA driver API (the calls made to the driver API by the runtime API). So, I changed the option accordingly:

nvcc -o toy toy.cu --cudart shared

Some runtime API functions might be internally calling that cuGetExportTable function. If the interception is done at the runtime API level, it does not pose an issue because this cuGetExportTable call then gets executed on the host, and the result of the runtime API that makes this call is sent back to the guest.

I would like to thank @hfingler for his guidance, that the issue had something to do with the setup. It gave me the confidence to fine-tune the area to twig accordingly.