Implementing a two-layer neural network from scratch

Question

Implementing a two-layer neural network from scratch

ljvmiranda921 opened this issue 3 years ago · comments

Lj Miranda commented 3 years ago

Written on 02/16/2017 04:06:50

URL: https://ljvmiranda921.github.io/notebook/2017/02/17/artificial-neural-networks/

Lj Miranda · Answer 1 · Fri Mar 26 2021 13:56:30 GMT+0800 (China Standard Time)

Comment written by Vinod Kumar on 07/20/2017 07:16:51

Thanks for the posts. Your posts have been very helpful.

Could you please help me with these confusions?

1. In these expressions:

grads['W2'] = np.dot(a1.T, dscores)

grads['W1'] = np.dot(X.T, dhidden)

Why is there a Transpose(a1.T and X.T)? Is it to make it suitable for dot product?

2. Why are we dividing the dscores by number of examples?
dscores /= N

Lj Miranda · Answer 2 · Fri Mar 26 2021 13:56:36 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 07/20/2017 16:28:35

Hi Vinod! Glad my posts helped!

As for your questions:

1. Yes you are correct. We want our matrices to have the same "inner size" so that doing a dot product won't break. Try checking the shape of the matrices in your implementation and make sure that $(a, b) \dot (b, c)$.

2. Recall that we are computing for a loss with the following equation:
$$
L = (1/N) \sum_{i} L_{i} + reg_term
$$

The derivation of the softmax doesn't include the 1/N yet, that's why we are
dividing it after computation.

Lj Miranda · Answer 3 · Fri Mar 26 2021 13:56:42 GMT+0800 (China Standard Time)

Comment written by Vinod Kumar on 07/20/2017 20:55:02

Oh right! I was only looking at the derivative of the loss function of a single example (Li) instead of the total loss.

Thanks for the clarification :)

Lj Miranda · Answer 4 · Fri Mar 26 2021 13:56:48 GMT+0800 (China Standard Time)

Comment written by chris on 07/22/2017 13:58:43

I might just jump in here since my question is basically one step before Vinod's questions:

Why are we doing this:

dscores[range(N),y] -= 1

instead of:

dscores[range(N),y] = -1

Why are we subtracting one for the right class score instead of setting it to '-1'.
From the lecture slides of cs231n I assumed we were basically doing df/df as the very beginning of backprob, so in this particular case this would be -(dL_i/dL_i). wouldn't it?

Lj Miranda · Answer 5 · Fri Mar 26 2021 13:56:55 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 07/22/2017 14:54:43

Hi Chris,

When we're doing backprop, one of the first steps would be to compute the gradient of the loss L_i with respect to the output f_k. Intuitively, we're "measuring" the change in the loss with respect to the output of the network, that is, dL_i / df_k . It turns out that the answer for this is simply

(dL_i / df_k) = p_k - 1 , where p_k is the softmax output (or the scores in our code).

Anyways, what we're gonna do now is to compute (dL_i / df_k) .

This can actually be seen in the lecture notes here (http://cs231n.github.io/neu....

For the actual derivation, what they did is that they set the softmax output as p (as p_k = exp(f_k) / sum(exp(f_j)) ) so that L_i = -log(p_k). And then via chain rule:

dL_i / df_k = (dL_i / dp_k) * (dp_k / df_k)

The first term is just a simple derivative of the log() function. The second term is the derivative of the softmax function. The math in the second term is a bit involved (and I don't know how to typeset LaTeX here in Disqus comments. It may look messy) so I can refer you to this link: http://eli.thegreenplace.ne...

Hope this explanation helped!

Lj Miranda · Answer 6 · Fri Mar 26 2021 13:57:01 GMT+0800 (China Standard Time)

Comment written by chris on 07/23/2017 21:25:51

Hi!

Yes, thanks a lot for the detailed explanation. I'm trying to map the calculations to the circuits that are visualized on the slides. However, I'm still struggling which gates are needed and which can be discarded and packed together with others.

Sorry for bothering, but could you briefly explain this step?:

dhidden = np.dot(dscores, W2.T)

Why are we using W2? What is the formula behind it?

Thanks a lot again!!

Lj Miranda · Answer 7 · Fri Mar 26 2021 13:57:07 GMT+0800 (China Standard Time)

Comment written by Matan Levy on 08/25/2017 10:29:33

Lj Miranda, thanks for the detailed explanations!

I have two small questions though,
1) when computing the regularization loss should we also consider the bias vectors?
2) for some reason I don't get the same loss as in the instructions (both with my code and when testing your), do you any idea what I am doing wrong?

Lj Miranda · Answer 8 · Fri Mar 26 2021 13:57:12 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 08/25/2017 11:07:26

Hi Matan!

1. For regularization, remember that we are using L2-regularization, you can find the equation in the CS231n course notes (http://cs231n.github.io/neu....

The intuition behind regularization is that we want our weights/parameters to be as "sparse" or as "simple" as possible. This means that we want to reduce complexity in our model and prevent it to overfit. So if we have a dense weight matrix, then we get a higher regularization loss.

Suppose we have a model that looks like this:

y = X0 + (W1 * X1) + (W2 * X2 ^ 2) + (W3 * X3 ^ 3) + (W4 * X4 ^ 4)

If we have a dense weight vector, that is, W = [1.0, 0.8, 0.74, 1.0], then most of the Xs are "turned on," the problem here is that there is a large tendency for this kind of model to overfit, and we don't want that. Now, if we optimize with a regularization loss to constrain our weights to be close to 0, then we might have a weight vector that looks like

W = [0.5, 0.001, 0.8, 0.00002]

The effect of W2 and W4 is lessened, and hopefully we arrive at a simpler model. The good thing about simpler models is that they tend to generalize better than overfit ones. And if they can generalize better, they can score higher if you feed it with data that is not seen.

Thus we see that regularization aims to increase bias in our model by penalizing our weights. Thus, the bias vectors have nothing to do with it, and so we don't really include in our computations.

2. As for #2, what are you getting? Hmmm, interesting. I will check on that again.

Lj Miranda · Answer 9 · Fri Mar 26 2021 13:57:19 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 08/25/2017 11:09:08

Correction, in fact what's happening when we do regularization is that we try to avoid peaks in our weight vectors and try to achieve a more diffused weight vector.

Lj Miranda · Answer 10 · Fri Mar 26 2021 13:57:24 GMT+0800 (China Standard Time)

Comment written by Matan Levy on 08/25/2017 14:11:19

Thank you!

1) I know better understand why we use the regularization loss,
I thought we should also include the bias, since in both SVM and Softmax classifiers we did apply the regularization loss on the bias as well (in these questions the bias came in the form of extra feature to the input).

2) I am not sure why, but I just had to omit the 0.5 factor from the regularization loss

Lj Miranda · Answer 11 · Fri Mar 26 2021 13:57:30 GMT+0800 (China Standard Time)

Comment written by earnshae on 12/12/2018 03:11:17

In order to get the same answers you have above I have to uses the following regularization expressions for W2 and W1:

grads['W2'] += 2 * reg * W2
grads['W1'] += 2 * reg * W1

I think the code code you have posted above may discount the p=0.5 per layer regularization factor which must be divided out for the correct answer.

This correlates with this students work: https://github.com/rahul199...