bwd_bypass kernel is empty on AMD HW

Question

bwd_bypass kernel is empty on AMD HW

mattsinc opened this issue 6 years ago · comments

I've been looking at the disassembly of the bwd_bypass kernel when run with MIOpen on AMD HW, and I'm noticing that the Bwd kernel for this run is empty:

(literally the entire program)
MIOpenNeuronBwd:
s_endpgm

I've checked the resultant code on multiple generations of AMD HW and multiple versions of MIOpen, and all of them are generating similar kernels. Thus, I was wondering if this is intentional? Is it really expected that the Bwd pass does nothing?

Thanks,
Matt

Matt Sinclair · Answer 1 · Tue May 08 2018 04:00:45 GMT+0800 (China Standard Time)

Wanted to ping you all about this again.

Matt

Shi Dong · Answer 2 · Thu May 10 2018 21:31:30 GMT+0800 (China Standard Time)

@mattsinc Sorry that I just saw your message. The bypass layer is just moving data from one place to another. It is originally used as a template to develop new layers. So it is doing nothing.
Hope this helps.

Shi

Matt Sinclair · Answer 3 · Fri May 11 2018 05:26:35 GMT+0800 (China Standard Time)

Hi @doody1986,

Thanks for getting back to me. I don't know if I quite understand your response though. I went through the OpenCL kernel that is being used to generate this kernel (MIOpen/src/kernels/MIOpenNeuron.cl). It turns out that every single #if and #elif will evaluate to false given what the environment variables are set to for bwd_bypass. Given this, only the following lines of code will actually be used in the compilation:

    int x = get_global_id(0);
    _FLOAT_ bot_diff_data[MLO_READ_UNIT];
    _FLOAT_ top_diff_data[MLO_READ_UNIT];
    _FLOAT_ bot_data[MLO_READ_UNIT];
    _FLOAT_ top_data[MLO_READ_UNIT];

    for (int i = 0; i < MLO_READ_UNIT; ++i)
    {
           {
                top_diff_dat[i] = top_diff[dyOffset + x * MLO_READ_UNIT + i];
                bot_dat[i]      = bot_data[xOffset + x * MLO_READ_UNIT + i];
                top_dat[i]      = top_data[yOffset + x * MLO_READ_UNIT + i];
           }
    }
    for (int i = 0; i < MLO_READ_UNIT; ++i)
    {
         {
               bot_diff[dxOffset + x * MLO_READ_UNIT + i] = bot_diff_dat[i];
         }
    }

All of the above code is updating input args to the kernel based on the arrays in the kernel or setting the local arrays to values from the input args. However, these arrays aren't even used after being set so my guess is that the compiler optimizes out everything in the kernel; which explains why it's empty.

But your response indicates that that it should be moving data from one place to another -- if the code is being optimized out, that wouldn't happen. Are you expecting that the arrays will be outputted? Or are you truly expecting that the kernel won't do anything?

Thanks,
Matt

Matt Sinclair · Answer 4 · Tue May 15 2018 00:10:09 GMT+0800 (China Standard Time)

@doody1986, wanted to ping you again about the above.

Shi Dong · Answer 5 · Tue May 15 2018 08:56:25 GMT+0800 (China Standard Time)

@mattsinc Thanks for pointing this out. Really appreciate! I went back to the code and figured it out. For the bypass layer when MIOpen is enabled, I use the activation function of MIOpen with a special flag indicating bypassing. You can refer to the link here for more information https://rocmsoftwareplatform.github.io/MIOpen/doc/html/activation.html#miopenactivationmode-t
The code you present must be corresponding to this part.

What I explained before is corresponding to the Nvidia implementation rather than the one for AMD, so it was not correct. Sorry about that. But when cuDNN is enabled, the copy actually happens.
I was expecting this layer to bypass the input data without any computations involved. The way AMD does seems like a bug, but if we think of it as an in-place bypass, it is exactly the behavior - doing nothing.

May I ask why this is bothering you?
Thanks

Shi

Matt Sinclair · Answer 6 · Tue May 15 2018 12:41:44 GMT+0800 (China Standard Time)

Hi @doody1986,

Thanks for getting back to me! I had noticed this because the profiler was telling me the kernel used 0 vector registers ... which seemed implausible. So I was trying to figure out how that was possible. The above explanation is helpful in that regard.

So is there a next step from here? Is it something wrong on your end? Do I need to open a ticket with the MIOpen folks? Is it just an optimization on the part of the MIOpen folks?

Thanks,
Matt

Shi Dong · Answer 7 · Wed May 16 2018 00:56:41 GMT+0800 (China Standard Time)

@mattsinc

Since I am just calling this API from MIOpen, there isn't too much things I can do in DNNMark and I am not quite familiar with their implementation detail for now.
If you think this part might affect your own work, then opening a ticket might be a good option. This is only my personal suggestion though. :)

Shi

Matt Sinclair · Answer 8 · Wed May 16 2018 02:02:57 GMT+0800 (China Standard Time)

@doody1986, thanks. It doesn't really affect my work, but it does bother me that there are known problems with unknown solutions. Anyways, will close.

Matt

Matt Sinclair · Answer 9 · Thu Jun 07 2018 07:46:53 GMT+0800 (China Standard Time)

FYI: ROCm/MIOpen#42