DropPath Implementation
IsmaelElsharkawi opened this issue · comments
Hi,
I had two questions about the implementation of the dropPath:
- Why do we do it per sample, as far as I understand from https://arxiv.org/pdf/1603.09382.pdf, you either take the whole batch or drop it all together with probability p_l, why is it done per sample here?
- What is the _div(keep_prob) used for, I can't see that in the equation of the paper as well, can you please clarify the reason behind that?
@IsmaelElsharkawi This sort of question is more appropriate as discussion. Stochastic depth is per sample, not per batch, believe it says 'independently per sample' somewhere in the paper.
The rescale is as per the sort of convoluted eq(5) and explanation, need to rescale because only a fraction of the activations participate in the output.
Thanks a lot for your explanation, and sorry for that, I'll continue this in a discussion thread.