[2GPU] Memcpy2D of matrixXmatrix -- src size (and form)

Question

[2GPU] Memcpy2D of matrixXmatrix -- src size (and form)

kotee4ko opened this issue 7 months ago · comments

Hello, Mr. @ggerganov , thank you for awesome project.

ggml_cuda_op_mul_mat() {
 ...
                // copy dst to host or other device if necessary
                if (!dst_on_device) {
                    void * dst_off_device;
                    cudaMemcpyKind kind;
                    if (dst->backend == GGML_BACKEND_CPU) {
                        dst_off_device = dst->data;
                        kind = cudaMemcpyDeviceToHost;
                    } else if (dst->backend == GGML_BACKEND_GPU) {
                        dst_off_device = dst_extra->data_device[g_main_device];
                        kind = cudaMemcpyDeviceToDevice;
                    } else {
                        GGML_ASSERT(false);
                    }
                    if (split) {
                        // src0 = weight matrix is saved as a transposed matrix for better memory layout.
                        // dst is NOT transposed.
                        // The outputs of matrix matrix multiplications can therefore NOT simply be concatenated for >1 GPU.
                        // Instead they need to be copied to the correct slice in ne0 = dst row index.
                        // If dst is a vector with ne0 == 1 then you don't have to do this but it still produces correct results.
                        float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
                        GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
                        dhf_dst_i += src1_col_0*ne0 + row_low[id];
                        CUDA_CHECK(cudaMemcpy2DAsync(dhf_dst_i, ne0*sizeof(float), dst_dd_i, row_diff*sizeof(float),
                                                    row_diff*sizeof(float), src1_ncols, kind, stream));
                    } else {
                        float * dhf_dst_i = (float *) ((char *) dst_off_device + i02*nb2 + i03*nb3);
                        GGML_ASSERT(dst->nb[1] == ne0*sizeof(float));
                        dhf_dst_i += src1_col_0*ne0;
                        CUDA_CHECK(cudaMemcpyAsync(dhf_dst_i, dst_dd_i, src1_ncols*ne0*sizeof(float), kind, stream));
                    }
                }

                // add event for the main device to wait on until other device is done
                if (split && (id != g_main_device || is != 0)) {
                    CUDA_CHECK(cudaEventRecord(src0_extra->events[id][is], stream));
                }
            }
   ...
   
 }

I'm trying to understand how this code should work.

What I clarify atm, is:

when split is true, than call to cudaMemcpy2DAsync() resulting in SIGSEGV, because of nullptr deref deeply in libhipamd64.so.
the total src size (WxH) would be half from src1_ncols*ne0*sizeof(float), when running on two GPUs.
the dst pointer, mem, memsize, and mem data (checked under rocgdb in runtime) seems to be ok.
the src mem + (WxH - 1) points to last usable byte of weights matrix, which seems to be ok too.
where WxH is row_diff*sizeof(float)*src1_ncols
if change 2D to classic async memcpy -- the crash is gone, but the results is wrong.

What I asking to hint, or explain me, please:

How and why dpitch and spitch should be formed
why src pitch is so large?
why src height is so small?
if 2 and 3 perform un-transposing of result?
why does it crash?
WxH is okey, but WxH*pitch is oob?

Thank you 💯

[in] | width | Width of matrix transfer (columns in bytes)
[in] | height | Height of matrix transfer (rows)

kotee4ko · Answer 1 · Mon Oct 23 2023 21:43:57 GMT+0800 (China Standard Time)

Hey, it seems I found a bug.

CUDA_CHECK(hipMemPtrGetInfo(dst_dd_i, &srcSize)); say, that src size is 0x9a00 in my case.
row_diff*sizeof(float)*src1_ncols == 0x9200 in my case.
but srcPitch is 0x2480.

If I understand right, to finish Memcpy2D we need at least M = (N / (Y / 2)) * Y.
output drom gdb confirm this:

pwndbg> hex SRC
+0000 0x7ff60d413000  c0 1b 52 bf  98 e9 b8 3f  aa 2c d7 bf  38 59 3e bf  │..R.│...?│.,..│8Y>.│
+0010 0x7ff60d413010  76 71 8d bd  49 1c 58 bf  ab 2e 26 bf  97 ad 12 3f  │vq..│I.X.│..&.│...?│
+0020 0x7ff60d413020  4b 2f 32 3f  17 30 2b 3e  3e 97 03 c0  b4 9f de bf  │K/2?│.0+>│>...│....│
+0030 0x7ff60d413030  70 5c 26 be  da 86 3d be  8e 6d 5f 3f  b0 2a 09 3f  │p\&.│..=.│.m_?│.*.?│
pwndbg> hex SRC+0x91f0
+0000 0x7ff60d41c1f0  e6 9c 32 3e  76 3a 41 be  0c 26 5b 3c  78 ba 3b 3e  │..2>│v:A.│.&[<│x.;>│
+0010 0x7ff60d41c200  be be be be  be be be be  be be be be  be be be be  │....│....│....│....│
...
+9820 0x7ff60d41d000  00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00  │....│....│....│....│

@ggerganov , what do you think about this?

And there are some more, about P2P memory access, but I don't sure yet.

Georgi Gerganov · Answer 2 · Tue Oct 24 2023 02:58:46 GMT+0800 (China Standard Time)

I'm afraid it will be difficult for me to help here, because I don't have a multi-GPU system to test with and I am not very familiar with this code. In general, multi-GPU support in ggml is in a poor state and probably some things don't work and can be improved.

If you think you've found a bug, please provide a proposed fix and steps to reproduce.

kotee4ko · Answer 3 · Tue Oct 24 2023 03:22:23 GMT+0800 (China Standard Time)

I'm trying to do this.

At the moment I can't understand why we can use result of src0 x src1 as is on single gpu, but can't on several gpus.

For hipBLAS we can't use memcpy2D to copy from gpu1 to gpu0, because memcpy2d doesn't support p2p mem access. We should copy by hand in a loop using memcpyd2d... At least for gpus which doesn't support p2p access.

As for twin gpu - I have a luck to have access to system with that HW on the work.

Upd: I impl loop, to copy data from src to dst with correct pitches and using DtoD.
This avoid crash, but results is incorrect (model just generate none-sense).

Could somebody explain what we are expecting to be in dst after call to ggml_cuda_op_mul_mat?

Upd2:
dst_on_device is true when running on single gpu. So this ifcase just skipped and then everything works great.
Split is true too, when running on single gpu.
Both src0 and src1 in non-transposed when running on single or multi gpu.