API remoting using the `onnx_dump` or `cudart` fails because the cuda program on the guest calls the API `cuGetExportTable` which is not implemented.
Abhishekghosh1998 opened this issue · comments
A note: I exactly don't know whether this issue should be categorized as a bug. My setup steps might be wrong as well. If it is the later situation, please guide me accordingly.
Description of the situation
When I try API remoting for a simple CUDA program compiled using nvcc
, the program fails during the remoting step.
On the guestlib side, the CUDA program hits cuGetExportTable
function; the guest program aborts, and the API server on the host exits.
$ cat toy.cu
/************************************************************toy.cu********************************************************/
#include <cuda.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#define BLOCK_SIZE 128
__global__
void do_something(float* d_array)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
d_array[idx]*=100;
}
int main()
{
long N= 1<<7;
float *arr = (float*) malloc(N*sizeof(float));
long i;
for (i=1;i<=N;i++)
arr[i-1]=i;
float *d_array;
cudaError_t ret;
ret = cudaMalloc(&d_array, N*sizeof(float));
printf("Return value of cudaMalloc = %d\n", ret);
if(ret != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s\n", cudaGetErrorString(ret));
exit(1);
}
ret = cudaMemcpy(d_array, arr, N*sizeof(float), cudaMemcpyHostToDevice);
printf("Return value of cudaMemcpy = %d\n", ret);
if(ret != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s \n", cudaGetErrorString(ret));
exit(1);
}
int num_blocks= (N+BLOCK_SIZE-1)/BLOCK_SIZE;
do_something<<<num_blocks, BLOCK_SIZE>>>(d_array);
ret = cudaMemcpy(arr, d_array, N*sizeof(float), cudaMemcpyDeviceToHost);
printf("Return value of cudaMemcpy = %d\n", ret);
int j;
for(i=0;i<N;)
{
for(j=0;j<8;j++)
printf("%.0f\t", arr[i++]);
printf("\n");
}
cudaFree(d_array);
return 0;
}
$ nvcc -o toy toy.cu
On the guest side:
$ ./toy
To check the state of AvA remoting progress, use `tail -f /tmp/fileKVRWBb`
Connect target API server (10.192.34.20:4000) at 10.192.34.20:4000
<000> <thread=7fc752c7aa00> cuDriverGetVersion(driverVersion=ptr 0x0000000000a0 = {10010,...}) -> 0
<001> <thread=7fc752c7aa00> cuInit() -> 0
toy: /media/hdd/abhishek/ava_verbose_new/cava/cudart_nw/cudart_nw_guestlib.cpp:28216: cuGetExportTable: Assertion `Unsupported API function: cuGetExportTable' failed.
Aborted (core dumped)
$
On the host side (The outputs might be a bit different than usual, because I have added few print statements for better understandability):
./install/bin/demo_manager --worker_path install/onnx_dump/bin/worker
Manager Service listening on ::3333
Receive connection from 172.17.0.2:44782
[from 172.17.0.2:44782] Request 1 GPUs
Spawn API server at 0.0.0.0:4000 (cmdline="CUDA_VISIBLE_DEVICES=0 AVA_CHANNEL=TCP 4000")
worker.cpp::init_worker
worker.cpp::__handle_command_onnx_dump_init
[worker#4000] To check the state of AvA remoting progress, use `tail -f /tmp/filevIxn6v`
[4000] Waiting for guestlib connection
[4000] Accept guestlib with API_ID=a
worker.cpp::__handle_command_onnx_dump
worker.cpp::__wrapper_cuDriverGetVersion
worker.cpp::__handle_command_onnx_dump
worker.cpp::__wrapper_cuInit
return value of __wrapper_cuInit: 0
[pid=69137] API server at ::4000 has exit (waitpid=-1)
To Reproduce
I'll go ahead and describe how I set up AvA.
First, I installed NVIDIA driver 418.226.00 using the NVIDIA-Linux-x86_64-418.226.00.run
from the NVIDIA website.
Second, I installed CUDA Toolkit 10.1 using the cuda_10.1.168_418.67_linux.run
from the NVIDIA website.
Third, I install cudnn 7.6.3.30 using the following files:
libcudnn7_7.6.3.30-1+cuda10.1_amd64.deb
libcudnn7-doc_7.6.3.30-1+cuda10.1_amd64.deb
libcudnn7-dev_7.6.3.30-1+cuda10.1_amd64.deb
Next, I forked the AvA repository.
I modified the ava/guestlib/cmd_channel_socket_tcp.cpp
to connect to my host using it's IP address.
And then did the following:
$ ava
$ ./generate -s onnx_dump
$ cd ..
$ mkdir build
$ cd build
$ cmake ../ava
$ ccmake . # and then selected the options for onnx_dump and demo manager
$ make -j72
$ make install
Then I used a CUDA-10.1 docker image (the one given this repository under tools/docker
, with a bit of modification to remove the issue of cuda keys for apt update)
Bind mounted my build directory in the docker container and then copied the libguestlib.so from the build directory to /usr/lib/x86_64-linux-gnu
in the docker container. And modified the library symlinks accordingly:
x86_64-linux-gnu$ ls -lh libcu*
lrwxrwxrwx 1 root root 17 Feb 25 2019 libcublasLt.so -> libcublasLt.so.10
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcublasLt.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 36M Feb 25 2019 libcublasLt.so.10.1.0.105
-rw-r--r-- 1 root root 23M Feb 25 2019 libcublasLt_static.a
lrwxrwxrwx 1 root root 15 Feb 25 2019 libcublas.so -> libcublas.so.10
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcublas.so.10 -> libguestlib.so
-rw-r--r-- 1 root root 75M Feb 25 2019 libcublas.so.10.1.0.105
-rw-r--r-- 1 root root 87M Feb 25 2019 libcublas_static.a
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcudart.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcudart.so.10.1 -> libguestlib.so
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcuda.so -> libguestlib.so
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcuda.so.1 -> libguestlib.so
-rwxr-xr-x 1 root root 16M Sep 2 13:03 libcuda.so.418.226.00
lrwxrwxrwx 1 root root 29 Mar 7 2019 libcudnn.so -> /etc/alternatives/libcudnn_so
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcudnn.so.7 -> libguestlib.so
-rw-r--r-- 1 root root 382M Feb 15 2019 libcudnn.so.7.5.0
lrwxrwxrwx 1 root root 32 Mar 7 2019 libcudnn_static.a -> /etc/alternatives/libcudnn_stlib
-rw-r--r-- 1 root root 351M Feb 15 2019 libcudnn_static_v7.a
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcufft.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root 23 Apr 6 2018 libcupsfilters.so.1 -> libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root 211K Apr 6 2018 libcupsfilters.so.1.0.0
-rw-r--r-- 1 root root 34K Dec 12 2018 libcupsimage.so.2
-rw-r--r-- 1 root root 558K Dec 12 2018 libcups.so.2
lrwxrwxrwx 1 root root 14 Sep 4 05:33 libcurand.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root 19 Jan 29 2019 libcurl-gnutls.so.3 -> libcurl-gnutls.so.4
lrwxrwxrwx 1 root root 23 Jan 29 2019 libcurl-gnutls.so.4 -> libcurl-gnutls.so.4.5.0
-rw-r--r-- 1 root root 499K Jan 29 2019 libcurl-gnutls.so.4.5.0
lrwxrwxrwx 1 root root 16 Jan 29 2019 libcurl.so.4 -> libcurl.so.4.5.0
-rw-r--r-- 1 root root 507K Jan 29 2019 libcurl.so.4.5.0
lrwxrwxrwx 1 root root 12 May 23 2018 libcurses.a -> libncurses.a
lrwxrwxrwx 1 root root 13 May 23 2018 libcurses.so -> libncurses.so
lrwxrwxrwx 1 root root 14 Sep 4 05:34 libcusolver.so.10 -> libguestlib.so
lrwxrwxrwx 1 root root 14 Sep 4 05:34 libcusparse.so.10 -> libguestlib.so
Added the guest config in the docker container as:
$ cat /etc/ava/guest.conf
channel = "TCP";
manager_address = "10.192.34.20:3333";
gpu_memory = [1024L];
Then I tried to launch the manger on the host as follows:
build$ ./install/bin/demo_manager --worker_path install/onnx_dump/bin/worker
Manager Service listening on ::3333
And on the guest, I try to run the toy
cuda program. But it fails as described earlier.
Expected behavior
The cudadrv
API remoting works fine for the rodinia benchmarks shared here: https://github.com/utcs-scea/ava-benchmarks/tree/master/rodinia/cuda
But neither the onnx_dump
nor the cudart
work.
Environment:
- OS: Ubuntu 18.04.6 LTS x86_64
- Python version: 3.6.9
- GCC version: 7.5.0
- Kernel: 5.4.0-150-generic
- Host: SYS-7049GP-TRT 0123456789
- CPU: Intel Xeon Gold 6140 (72) @ 3.700GHz
- GPU: NVIDIA Tesla P40
- NVIDIA Driver Version: 418.226.00
- CUDA Version: 10.1
Hi- Sorry for the late reply. This is a known issue. We haven't supported cuGetExportTable
, because it's a non-public API and we don't yet understand its syntax. cuGetExportTable
is usually called by some cudnn, cublas APIs, iirc, but it looks odd to me that your sample program also triggers the call.
@yuhc Thanks for your reply.
I figured out the issue that I was facing with cuGetExportTable
. It was caused by a setup configuration actually. I was compiling the programs with the nvcc
command simply as follows:
nvcc -o toy toy.cu
The above command statically linked the libcudart
with the executable. As a result, the interception happened at the CUDA driver API (the calls made to the driver API by the runtime API). So, I changed the option accordingly:
nvcc -o toy toy.cu --cudart shared
Some runtime API functions might be internally calling that cuGetExportTable
function. If the interception is done at the runtime API level, it does not pose an issue because this cuGetExportTable
call then gets executed on the host, and the result of the runtime API that makes this call is sent back to the guest.
I would like to thank @hfingler for his guidance, that the issue had something to do with the setup. It gave me the confidence to fine-tune the area to twig accordingly.