virtio-cuda-module

This is the para-virtualized front driver of cuda-supported qemu and test case.

The user runtime wrappered library in VM (guest OS) provides CUDA runtime access, interfaces of memory allocation, CUDA commands, and passes cmds to the driver.

The front-end driver is responsible for the memory management, transferring data, analyzing the ioctl cmds from the customed library, and passing the cmds by the control channel.

Installation

Prerequisites

The our experiment environment is as follows:

Host

Ubuntu 16.04.5 LTS (kernel v4.15.0-29-generic x86_64)
cuda-9.1
PATH

echo 'export PATH=$PATH:/usr/local/cuda/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib:/usr/local/cuda/lib64' >> ~/.bashrc
source ~/.bashrc

sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
sudo ldconfig

Install required packages

sudo apt-get install -y  pkg-config bridge-utils uml-utilities zlib1g-dev libglib2.0-dev autoconf
automake libtool libsdl1.2-dev libsasl2-dev libcurl4-openssl-dev libsasl2-dev libaio-dev libvde-dev libspice-server-dev

Guest

Ubuntu 16.04 x86_64 image (guest OS)
cuda-9.1 toolkit

How to install

Host

Our QEMU was modified from QEMU 2.12.0, for further information please refer to QEMU installation steps

Guest

clone this repo.
[to do]

A CUDA sample in guest OS

In the guest OS, nvcc compiles the source with host/device code and standard CUDA runtime APIs. To compare with a native OS, in the guest VM, compiling the CUDA program must add the nvcc flag "-cudart=shared", which can be dynamically linked to the userspace library as a shared library.
Therefore, the wrappered library provided functions that intercepted dynamic memory allocation of CPU code and CUDA runtime APIs.
After installing qCUdriver and qCUlibrary in the guest OS, modify the internal flags in the Makefile as below:

# internal flags
NVCCFLAGS   := -m${TARGET_SIZE} --cudart=shared

Finally, run make and perform the executable file without change any source code by LD_PRELOAD or change the LD_LIBRARY_PATH.

LD_PRELOAD=\path\to\libvcuda.so ./vectorAdd

benchmarking vectorAdd

A command-line benchmarking tool hyperfine is recommended.
To run a benchmark, you can simply call hyperfine <command>.... , for example.

hyperfine 'LD_PRELOAD=\path\to\libvcuda.so ./vectorAdd'

By default, Hyperfine will perform at least 10 benchmarking runs. To change this, you can use the -m/--min-runs option or -M/--max-runs.

supported API

CUDA Runtime API

In our current version, we implement necessary CUDA runtime APIs. These CUDA runtime API are shown as below:

Classification	supported CUDA runtime API
Memory Management	cudaMalloc
	cudaMemset
	cudaMemcpy
	cudaMemcpyAsync
	cudaFree
	cudaMemGetInfo
	cudaMemcpyToSymbol
	cudaMemcpyFromSymbol
Device Management	cudaGetDevice
	cudaGetDeviceCount
	cudaSetDevice
	cudaSetDeviceFlags
	cudaGetDeviceProperties
	cudaDeviceSynchronize
	cudaDeviceReset
Stream Management	cudaStreamCreate
	cudaStreamCreateWithFlags
	cudaStreamDestroy
	cudaStreamSynchronize
	cudaStreamWaitEvent
Event Management	cudaEventCreate
	cudaEventCreateWithFlags
	cudaEventRecord
	cudaEventSynchronize
	cudaEventElapsedTime
	cudaEventDestroy
	cudaEventQuery
Error Handling	cudaGetLastError
Error Handling	cudaGetErrorString
Zero-copy	cudaHostRegister
	~~cudaHostGetDevicePointer~~
	cudaHostUnregister
	cudaHostAlloc
	cudaMallocHost
	cudaFreeHost
	cudaSetDeviceFlags
Thread Management	cudaThreadSynchronize
Module & Execution Control	__cudaRegisterFatBinary
	__cudaUnregisterFatBinary
	__cudaRegisterFunction
	__cudaRegisterVar
	cudaConfigureCall
	cudaSetupArgument
	cudaLaunch

CUBLAS API & CURAND API

To support Caffe, we implement CUBLAS & CURAND API in libcudart.so.

Classification	supported API
CUBLAS API	cublasCreate
	cublasDestroy
	cublasSetVector
	cublasGetVector
	cublasSetStream
	cublasGetStream
	cublasSasum
	cublasDasum
	cublasScopy
	cublasDcopy
	cublasSdot
	cublasDdot
	cublasSaxpy
	cublasDaxpy
	cublasSscal
	cublasDscal
	cublasSgemv
	cublasDgemv
	cublasSgemm
	cublasDgemm
	cublasSetMatrix
	cublasGetMatrix
CURAND API	curandCreateGenerator
	curandCreateGeneratorHost
	curandGenerate
	curandGenerateUniform
	curandGenerateUniformDouble
	curandGenerateNormal
	curandGenerateNormalDouble
	curandDestroyGenerator
	curandSetGeneratorOffset
	curandSetPseudoRandomGeneratorSeed

Supported Software

part of NVIDIA_CUDA-9.1_Samples
Rodinia benchmark
Caffe: a fast open framework for deep learning.

Last but not least, thanks qcuda for inspiring.
Also, what we use for message channels is [chan: Pure C implementation of Go channels. ](https://github.com/tylertreat/chan.gitPure C implementation of Go channels. )

About

cuda-supported qemu front-driver and test case

Languages

Language:C++ 39.4%Language:C 24.6%Language:Makefile 19.4%Language:Cuda 16.4%Language:Objective-C 0.2%Language:GLSL 0.1%Language:Python 0.0%Language:Shell 0.0%