Understanding softmax and the negative log-likelihood

Question

Understanding softmax and the negative log-likelihood

ljvmiranda921 opened this issue 3 years ago · comments

Lj Miranda commented 3 years ago

Written on 08/13/2017 13:51:23

URL: https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/

Lj Miranda · Answer 1 · Fri Mar 26 2021 14:01:02 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 12/28/2017 15:46:28

Hi! Would you like to point it out?

Lj Miranda · Answer 2 · Fri Mar 26 2021 14:01:08 GMT+0800 (China Standard Time)

Comment written by jus1802 on 12/28/2017 17:48:23

Yeah, I think it should be $\exp{f_{y_i}}$.

Lj Miranda · Answer 3 · Fri Mar 26 2021 14:01:14 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 01/29/2018 23:57:10

Fixed!

Lj Miranda · Answer 4 · Fri Mar 26 2021 14:01:26 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 03/04/2018 10:29:37

Hi, here I'm only taking the y_i=k case, i.e., whenever the output matches the required class.

Lj Miranda · Answer 5 · Fri Mar 26 2021 14:01:32 GMT+0800 (China Standard Time)

Comment written by Antonio Gutierrez on 05/07/2018 17:28:11

Shouldn't the derivative of this:

$$ L_i = -log(p_{y_{i}}) $$

Be this:

$$ \dfrac{\partial L_i}{\partial p_k} = -\dfrac{1}{ln10 * p_k} $$

That is: the derivative is missing the ln 10 in the denominator?

Edit: After reading the wikipedia article on log-likelihood, I realized that the "log" referred to in the formulas is actually the natural logarithm i.e: ln, therefore the derivative in the article is indeed correct.

Any reason why the literature chooses to use "log" instead of "ln"?

Lj Miranda · Answer 6 · Fri Mar 26 2021 14:01:38 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 05/25/2018 06:19:58

Hi Antonio,

Really sorry for the very late reply. I think I messed up Disqus notification settings and this didn't show up. Yes you are correct, usually we assume an unadorned log as the natural algorithm right away.

> Any reason why the literature chooses to use "log" instead of "ln"?

Not a trivial question to be honest. I think the common answer is that it's more convenient to work with natural logs (https://en.wikipedia.org/wi..., but I'm not sure if that answer satisfies you. Interesting question though, let me dig into this question this weekend.

Lj Miranda · Answer 7 · Fri Mar 26 2021 14:01:44 GMT+0800 (China Standard Time)

Comment written by Arpit Jain on 12/16/2018 03:43:21

Nice article. But I think the values of the Negative Log-Likelihood loss function are incorrect for the first (cat) and third (dog) images.

Lj Miranda · Answer 8 · Fri Mar 26 2021 14:01:50 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 12/16/2018 06:44:02

What values are you getting?

Lj Miranda · Answer 9 · Fri Mar 26 2021 14:01:56 GMT+0800 (China Standard Time)

Comment written by Arpit Jain on 12/16/2018 06:50:52

Hey! My bad. I was doing a calculation mistake of taking log instead of ln.

Lj Miranda · Answer 10 · Fri Mar 26 2021 14:02:01 GMT+0800 (China Standard Time)

Comment written by kathylewisinmon on 12/25/2018 03:41:30

lj_miranda why

Lj Miranda · Answer 11 · Fri Mar 26 2021 14:02:07 GMT+0800 (China Standard Time)

Comment written by bitjoy on 04/30/2019 14:39:05

Thanks for you good explanation, but I'm still confused that why log_softmax was used along with nll in pytorch mnist tutorial (https://github.com/pytorch/.... What's the advantage of log_softmax over softmax?

Lj Miranda · Answer 12 · Fri Mar 26 2021 14:02:13 GMT+0800 (China Standard Time)

Comment written by 강민성 on 08/24/2019 10:38:54

Thank you for the great article. I easily understood the concept thanks to you!

Lj Miranda · Answer 13 · Fri Mar 26 2021 14:02:19 GMT+0800 (China Standard Time)

Comment written by athrun200 on 12/12/2019 07:22:52

May I know after differentiating the negative log likelihood with respect to
the softmax layer, how are we going to use this result?

If we differentiate the loss function with respect to weights, it tells us how to update the weights. Then what can "differentiating the negative log likelihood with respect to
the softmax layer" tell us?

Lj Miranda · Answer 14 · Fri Mar 26 2021 14:02:25 GMT+0800 (China Standard Time)

Comment written by Božidar Mitrović on 05/02/2020 17:05:06

What's the meaning of fk? I still don't understand the difference between pk and fk

Lj Miranda · Answer 15 · Fri Mar 26 2021 14:02:30 GMT+0800 (China Standard Time)

Comment written by Pablo Casas on 08/17/2020 04:01:03

Outstanding work. Thanks a lot. 🥇

Lj Miranda · Answer 16 · Fri Mar 26 2021 14:02:36 GMT+0800 (China Standard Time)

Comment written by Enzo Ampil on 08/31/2020 07:29:50

Great blog post LJ! Was quickly reviewing NLL loss and your post was the first result from the search :)

Lj Miranda · Answer 17 · Fri Mar 26 2021 14:02:42 GMT+0800 (China Standard Time)

Comment written by InsideAIML on 10/28/2020 06:17:23

Keep up the great work, I read few blog posts on this site and I believe that your website is really interesting and has loads of good info. Lovely blog ..! I really enjoyed reading this article. keep it up!!
Softmax Activation Function Python
Best Artificial Intelligence & Data Science In Pune With Placement

Lj Miranda · Answer 18 · Fri Mar 26 2021 14:02:48 GMT+0800 (China Standard Time)

Comment written by Gaurav on 10/29/2020 13:25:45

Great article, One doubt -> should not the sum of SoftMax's output be 1 ?
0.71,0.26,0.04 != 1

Lj Miranda · Answer 19 · Fri Mar 26 2021 14:02:53 GMT+0800 (China Standard Time)

Comment written by Chassson on 10/31/2020 00:10:19

Awesome article! Thanks!

Lj Miranda · Answer 20 · Fri Mar 26 2021 14:02:59 GMT+0800 (China Standard Time)

Comment written by Lj Miranda on 10/31/2020 04:43:41

Oops, yeah you're right. I think I rounded-up some values that's why they didn't sum-up correctly. Thanks for catching!

Pranath Fernando · Answer 21 · Fri May 07 2021 23:50:59 GMT+0800 (China Standard Time)

Thank you so much for your great explanations! I have been trying to understand x-entropy, softmax was more clear to me but NLL was less clear. Your nice diagrams and fun explanations e.g. lower probability softmax values generating potentially infinite unhappiness - made me laugh so much - and more importantly - help me understand and remember it. Thank you so much for this article - brilliant!

Lj Miranda · Answer 22 · Wed May 26 2021 22:07:10 GMT+0800 (China Standard Time)

Hi @pranath, thanks so much! I am glad the explanations helped you out. Hehe, being funny and informative is the goal! Stay safe and have a good day!

SundaraRaman R · Answer 23 · Wed Jun 09 2021 01:26:10 GMT+0800 (China Standard Time)

You mentioned in another comment "I'm only taking the y_i=k case, i.e., whenever the output matches the required class." The derived (p_k - 1) applies to those cases, and is backpropagated. Is the derivative with respect to every other neuron (that doesn't match the required class) equal 0? Since \dfrac{\partial L_i}{\partial p_k} = 0 when y_i != k?
Does that mean the backpropagation only happens in the output class neurons (for a given input) of the softmax layer, and all the other neurons don't send back any gradient updates?

John MacCormick · Answer 24 · Wed Jul 14 2021 05:41:51 GMT+0800 (China Standard Time)

@digital-carver, Hi, I didn't check everything carefully yet but I am responding in the hope that someone else can confirm/deny. I am very interested in your question. I believe the short answer is "yes". If we are trying to minimize the loss based on only a single training example S and the loss is defined as negative log likelihood, then only one output node has an effect on the loss.

To see this, consider a concrete example. Say there are 10 classes. In the notation of the article above, the model's current parameters state that the likelihood of our training sample S being in class 1 is $p_1$. The likelihood of S being in class 2 is $p_2$. And so on up to $p_{10}$. Now suppose that S is actually in class 4. This means the likelihood of our observed outcome is $p_4$. To improve our model based only on this observation, we need to increase $p_4$ by changing our parameters. When we change the parameters, the other probabilities $p_i$ will change, but we don't care about that. If our sole objective is to maximize the likelihood of this one observation, we will just alter the parameters to increase $p_4$.

In practice, we prefer to minimize $-\log(p_4)$ rather than maximizing $p_4$, but that turns out to be for reasons of numerical stability and elegance of the formulas. Philosophically speaking, we are still trying to increase $p_4$ and we are happy to ignore the other $p_i$ values.

In a couple of other places I have seen this described via cross entropy. The true output distribution is one-hot encoded as $Y=(0,0,0,1,0,0,0,0,0,0)$. The cross entropy between $Y$ and $p$ is $\sum_{c=1}^{10} -Y_c \log(p_c)$, and this is what we want to minimize. But all elements of the sum are zero except for the one corresponding to the class of our training sample, so the thing we are trying to minimize simplifies to $-\log(p_4)$ as before.

You could also write down a more complete likelihood function for any $Y$ as follows: the likelihood is $\prod_{c=1}^{10} {p_c}^{Y_c}$. Taking the negative log likelihood of this likelihood function and substituting in the individual values for $Y_c$ again gives the same result.

Can anyone confirm if these formulations are correct? Thanks! And also many thanks to @ljvmiranda921 for the original article.

John MacCormick · Answer 25 · Fri Jul 16 2021 20:49:43 GMT+0800 (China Standard Time)

I studied this more carefully. I believe the original article is correct but incomplete. It is important to calculate the derivative with respect to other p_i, with i \neq k --- not just the value k which corresponds to the observed class value. I made some very detailed softmax derivative notes to make sure I understood it. Final answer: derivative for the observed class k is {p_k}-1, for other values of i the derivative is p_i.