Cuda_practice Note First cudaMalloc is very slow (we have to do dummy cudaMalloc in implementation) cudaMemcpy from device to host is much slower than host to device (about 10 times difference)