How to run a random circuite in a batched way?

Question

How to run a random circuite in a batched way?

zipeilee opened this issue 10 months ago · comments

I wand to run a random unitary circuite in so many instance (like 1000 instance) and return the averge value like :

reg = zero_state(1)
mean([expect(Z, reg |> dispatch!(Rx,:random)) for _ in 1:1000])

but I want run it in a batched way, I know 幺 has batchedarrayreg, but I don't know how to performance random circuit in each instance to a batchedreg in a batched way. I tried

reg = zero_state(1,nbatch=1000)
expect(Z, reg|>dispatch!(Rx,:random))

but it seems not work, it just will pick one random instance 1000 times. What is the correct way?

Jinguo Liu · Answer 1 · Wed Nov 08 2023 12:47:23 GMT+0800 (China Standard Time)

Unfortunately, there is no easy way to do that. You need to copy reg, because |> changes the state inplace.

julia> sum([expect(Z, copy(reg) |> dispatch!(Rx(0),rand()*2π)) for _ in 1:1000])/1000
0.027340815055604813 + 0.0im

zipeilee · Answer 2 · Wed Nov 08 2023 13:40:18 GMT+0800 (China Standard Time)

In fact, I hope to use the parallel computing of the GPU for batch processing. But this does not seem to be a good use of the parallel computing of the GPU.

Jinguo Liu · Answer 3 · Wed Nov 08 2023 19:34:10 GMT+0800 (China Standard Time)

I see. In your case, I would suggest you writing a new kernel, since this features is not supported by Yao yet.

define a new gate type with batched parameters.
dispatch the gate to the correct instruct! function. The current single parameter rotation gate calls into this implementation:
https://github.com/QuantumBFS/CuYao.jl/blob/05f365f8f8e49fa2787df50a6e2226f508c94d80/src/instructs.jl#L19
You need to implement a new CUDA kernel (check bellow), it should not be too difficult if you know CUDA programming.

Hint of rewriting this instruct

instruct!(::Val{2}, state::DenseCuVecOrMat, U0::AbstractMatrix, locs::NTuple{M, Int}, clocs::NTuple{C, Int}, cvals::NTuple{C, Int})

The Val{2} means it is for qubit, rather than qudit.
The state is a vector or matrix as the register storage.
U0 is the gate matrix. In your case, you need to input a rank-3 tensor, and each batch stores a 2x2 matrix. In different CUDA thread, you should use different matrix.
locs is the locations that this bit applies on. For single qubit gate, it should only contain one element.
clocs and cvals should be empty tuple in the absense of control bits.

Please feel free to ask if you encounter any issue.

zipeilee · Answer 4 · Thu Nov 09 2023 18:48:44 GMT+0800 (China Standard Time)

Thanks for your patience and guidance! Actually, I need compute a chain block with such one qubit gates layer and two qubits gate layer in many qubits. You give me a good advice, I will try it.