cublas backend of MatMul does not work with stream parallelism

Question

cublas backend of MatMul does not work with stream parallelism

roastduck opened this issue 4 months ago · comments

We should run cublas in an appropriate stream, and this further require to create a different cublas handle for each stream. Since we cache cublas in GPUContext, we should make the cache available for multiple streams.