donalee / DeepMCDD

Multi-class Data Description for Out-of-distribution Detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Formula and loss function

yanchenyuxia opened this issue · comments

Hello,sir.I I don't quite understand the code in your paper.I understand that the output of the model is D(k).But i don't understand scores = torch.sum((out - self.centers)**2, dim=2) / 2 / torch.exp(2 * F.relu(self.logsigmas)) + self.latent_size * F.relu(self.logsigmas)
.I can understand the meaning of formula 3, but I don't understand formula 5 in the paper, and I don't know where it is reflected in the code. I don't know what pull_loss means.
Please help me out .Thank you,sir.

  • As you mentioned, the output of our network (i.e., scores) is D(x)=[D_1(x), ..., D_K(x)]. This distance function is defined in Equation (4), and you will find that the implementation of the score is equivalent to the definition.
  • Our final objective in Equation (5) consists of two parts:
  1. D_y(x) implemented as pull loss; this term pulls each representation f(x) into its class center vector.
  2. The log posterior implemented as push loss; this term push away each representation from other center vectors to make them separable for classification.

For more details, please refer to Section 3.2. Thank you!

Thank you very much!I think I understand. My understanding is: push is the cross entropy in the code, which is the numerator of the latter item in the brackets in formula 5 in the paper. I can only understand 1/v in the formula as y_i before the log of the cross entropy.In the paper, formula 5 is a minus sign, but in the code, it becomes a plus sign.Why?
loss = args.reg_lambda * pull_loss + push_loss
As you mentioned in section 3.2,I can understand the posterior probability.I also know that you use KL divergence to ensure that the difference between class-conditional and Gaussian distribution is minimal.But I do not understand this sentence: "Finally, we complete our objective by combining this KL term with the posterior term using the ν-weighted sum in order to control the effect of the regularization".
The last question, the KL divergence simplifies to a D_y(x) inside the last parentheses.So where does it show up in the code?Is it scores?
My sincere thanks again to you!

My understanding is: push is the cross entropy in the code, which is the numerator of the latter item in the brackets in formula 5 in the paper.

Yes, that's right.

I can only understand 1/v in the formula as y_i before the log of the cross entropy. In the paper, formula 5 is a minus sign, but in the code, it becomes a plus sign.Why?

In our code loss = args.reg_lambda * pull_loss + push_loss, we multiply v (reg_lambda) and the KL term (pull_loss), rather than multiplying 1/v and the cross-entropy term (push_loss). This implementation seems more intuitive and I empirically found that it facilitates the optimization. In addition, we do not need to explicitly use a minus sign, because torch.nn.CrossEntropyLoss returns the "negative" softmax term for the target class (i.e., y_i) by the definition of the cross-entropy.

But I do not understand this sentence: "Finally, we complete our objective by combining this KL term with the posterior term using the ν-weighted sum in order to control the effect of the regularization". The last question, the KL divergence simplifies to a D_y(x) inside the last parentheses. So where does it show up in the code? Is it scores?

The sentence means that our final objective (5) (in Section 3.1) is the v-weighted sum of the KL term and the negative log posterior probability (in Section 3.2). As you pointed out, the KL term is simplified into D_y(x) inside the last parenthesis. This part is implemented in pull_loss, which is computed by torch.mul(label_mask, dists); since we need to only minimize the distance between a representation f(x) and its corresponding class center vector mu_y, we calculate the pull_loss by imposing the label_mask on the dists tensor.

I hope my answer will be helpful. Thanks!

Thank you very much, sir.
But when I wrote and read the paper summary , I had another thing I didn't quite understand, or maybe I didn't understand it was the expression in the paper.The KL divergence to ensure that the difference between class-conditional and Gaussian distribution is minimal.The paper defines the distribution of each class as a Gaussian distribution, and the probability of which class is selected is regarded as the Bernoulli distribution.The D_k is the class-conditional distribution probability.
In "we minimize the Kullback-Leibler (KL) divergence between the k-th empirical class-conditional distribution and the Gaussian distribution",what is " k-th empirical class-conditional distribution"?What is the difference between this K and the above class-conditional. These two concepts make me a little dizzy.
I'm so sorry to bother you so much!Thank you,sir!

The goal of DeepMCDD is to optimize the parameters so that the actual data representations f(x) from each class follow an isotropic Gaussian distribution. To this end, each Gaussian distribution (for class k) is modeled by its class mean (mu_k) and standard deviation (sigma_k), denoted by N(mu_k, sigma_k^2 I). However, the empirical distribution for class k (P_k) induced by the actual data representations from class k may not match with the k-th Gaussian distribution that we assume, N(mu_k, sigma_k^2 I). For this reason, we define the empirical distribution by the average of the dirac delta functions for all representations of a target class, then minimize the KL divergenece: KL(P_k || N(mu_k, sigma_k^2 I). Thank you.

Thank you so much. I have fully understood this paper, and I have no doubts! And I have to say, this paper is very well written!

My pleasure :-)