The matching problem of loss function and corresponding code

Question

The matching problem of loss function and corresponding code

yjh576 opened this issue 4 years ago · comments

There are several lines code in function SoftTripletLoss
triple_dist = torch.stack((dist_ap, dist_an), dim=1)
triple_dist = F.log_softmax(triple_dist, dim=1)
mat_dist_ref = euclidean_dist(emb2, emb2)
dist_ap_ref = torch.gather(mat_dist_ref, 1, ap_idx.view(N,1).expand(N,N))[:,0]
dist_an_ref = torch.gather(mat_dist_ref, 1, an_idx.view(N,1).expand(N,N))[:,0]
triple_dist_ref = torch.stack((dist_ap_ref, dist_an_ref), dim=1)
triple_dist_ref = F.softmax(triple_dist_ref, dim=1).detach()
oss = (- triple_dist_ref * triple_dist).mean(0).sum()
return loss
I think it should be:
triple_dist = torch.stack((dist_ap, dist_an), dim=1)
triple_dist = F.log_softmax(triple_dist, dim=1)
mat_dist_ref = euclidean_dist(emb2, emb2)
dist_ap_ref = torch.gather(mat_dist_ref, 1, ap_idx.view(N,1).expand(N,N))[:,0]
dist_an_ref = torch.gather(mat_dist_ref, 1, an_idx.view(N,1).expand(N,N))[:,0]
triple_dist_ref = torch.stack((dist_ap_ref, dist_an_ref), dim=1)
triple_dist_ref = F.softmax(triple_dist_ref, dim=1).detach()
# loss = (- triple_dist_ref * triple_dist).mean(0).sum()
loss = (- triple_dist_ref[:,0] * triple_dist[:,0]).mean()
return loss

your code is : -log{exp(F(x_i)F(x_i,p))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]} - log{exp([F(x_i)F(x_i,n))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]} , which is not consistent with the loss in your paper.
my modified code is : -log{exp(F(x_i)F(x_i,p))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]}, which is consistent with your paper. However, the performace of my modified code is worse than you original code.
I can't understand the question above.
I'm looking forward to your reply!

Yixiao Ge · Answer 1 · Thu Apr 30 2020 20:35:03 GMT+0800 (China Standard Time)

Hi,

Our code you mentioned is exactly consistent with the loss function Eq. (8) in our paper (https://openreview.net/pdf?id=rJlnOhVYPS). I guess you have mistaken Eq. (7) as our loss function, but in fact, Eq. (7) serves for Eq. (8). Please check it again.

Our loss function is a binary cross-entropy loss with soft labels.
For example, the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In our loss function, q and p are both within [0,1].

Yixiao Ge · Answer 2 · Thu Apr 30 2020 20:37:53 GMT+0800 (China Standard Time)

Your modification could be seen as -qlogp, losing half of the regularization.

Junhui Yin · Answer 3 · Thu Apr 30 2020 20:59:07 GMT+0800 (China Standard Time)

Thank you for your quick reply.
Sorry, I make a mistake. It is really correct that the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p). I understand it. Thank you.

Your work is good. I also have a question.
I run your code. I find the follwing code is very important for performance:
model_1.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda())
model_2.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda()) model_1_ema.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda()) model_2_ema.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda())

why? What the above code achieve? Removing the above code will lead to the worse performace comparing with the former.

Junhui Yin · Answer 4 · Thu Apr 30 2020 21:24:17 GMT+0800 (China Standard Time)

The loss function Eq. (7) is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In fact, it is -logp because of q = 1. But I find the code is
self.criterion_tri = SoftTripletLoss(margin=0.0).cuda()
triple_dist = torch.stack((dist_ap, dist_an), dim=1)
triple_dist = F.log_softmax(triple_dist, dim=1)
loss = (- self.margin * triple_dist[:,0] - (1 - self.margin) * triple_dist[:,1]).mean()
I think the above code is -log(1-p), which is not -logp. I can't understand it.
I'm looking forward to your reply!

Yixiao Ge · Answer 5 · Thu Apr 30 2020 21:30:19 GMT+0800 (China Standard Time)

Thank you for your quick reply.
Sorry, I make a mistake. It is really correct that the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p). I understand it. Thank you.

Your work is good. I also have a question.
I run your code. I find the follwing code is very important for performance:
model_1.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda())
model_2.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda()) model_1_ema.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda()) model_2_ema.module.classifier.weight.data[:args.num_clusters].copy_(F.normalize(cluster_centers, dim=1).float().cuda())

why? What the above code achieve? Removing the above code will lead to the worse performace comparing with the former.

Please refer to #16

Yixiao Ge · Answer 6 · Thu Apr 30 2020 21:38:03 GMT+0800 (China Standard Time)

The loss function Eq. (7) is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In fact, it is -logp because of q = 1. But I find the code is
self.criterion_tri = SoftTripletLoss(margin=0.0).cuda()
triple_dist = torch.stack((dist_ap, dist_an), dim=1)
triple_dist = F.log_softmax(triple_dist, dim=1)
loss = (- self.margin * triple_dist[:,0] - (1 - self.margin) * triple_dist[:,1]).mean()
I think the above code is -log(1-p), which is not -logp. I can't understand it.
I'm looking forward to your reply!

When margin=0.0, loss = (- self.margin * triple_dist[:,0] - (1 - self.margin) * triple_dist[:,1]).mean() can be thought of as loss = (- triple_dist[:,1]).mean(). The value of triple_dist[:,1] is exactly the same as Eq. (7) in the paper. Please check.

Yixiao Ge · Answer 7 · Thu Apr 30 2020 21:41:59 GMT+0800 (China Standard Time)

loss = (- triple_dist[:,1]).mean() means the euclidean distance between anchor and negative should be larger than the euclidean distance between anchor and positive.

Yixiao Ge · Answer 8 · Thu Apr 30 2020 21:51:25 GMT+0800 (China Standard Time)

-qlogp-(1-q)log(1-p) is only a simplified formulation of BCE loss. If you want to align this function with our Eq. (6), you should use q=1-self.margin, p=triple_dist[:,1] where 1-p=triple_dist[:,0].

Junhui Yin · Answer 9 · Thu Apr 30 2020 22:01:21 GMT+0800 (China Standard Time)

I think that loss = (- triple_dist[:,1]).mean() means the similarity between anchor and negative. The similarity should be smaller than the similarity between anchor and positive.

I think that this function with your Eq. (6) should be loss = (- triple_dist[:,0]).mean(). We hope that
the similarity between anchor and positive become lager. This is because hard_p means the large similarity between anchor and positive according to the code. I have some confusion about it.
Please refer some code:
sorted_mat_distance, positive_indices = torch.sort(mat_distance + (-9999999.) * (1 - mat_similarity), dim=1, descending=True)
hard_p = sorted_mat_distance[:, 0]
hard_p_indice = positive_indices[:, 0]
sorted_mat_distance, negative_indices = torch.sort(mat_distance + (9999999.) * (mat_similarity), dim=1, descending=False)
hard_n = sorted_mat_distance[:, 0]
hard_n_indice = negative_indices[:, 0]

Yixiao Ge · Answer 10 · Thu Apr 30 2020 22:05:37 GMT+0800 (China Standard Time)

Please note that we use Euclidean distance instead of cosine similarity in our code to measure the feature similarity (https://github.com/yxgeee/MMT/blob/master/mmt/loss/triplet.py#L78).
The Euclidean distance between anchor and negative should be larger than the Euclidean distance between anchor and positive.

Yixiao Ge · Answer 11 · Thu Apr 30 2020 22:08:23 GMT+0800 (China Standard Time)

Also in our paper, in Equation (7), we use the root of Euclidean distance, which is also called L2-norm distance.

Yixiao Ge · Answer 12 · Thu Apr 30 2020 22:10:47 GMT+0800 (China Standard Time)

Larger euclidean distance indicates smaller similarity, and vice versa.

Junhui Yin · Answer 13 · Thu Apr 30 2020 22:13:14 GMT+0800 (China Standard Time)

Thank you. I have some confusion about Equation (7) and Equation (2). I think they have the same function. So, I think that.

Junhui Yin · Answer 14 · Thu Apr 30 2020 22:14:05 GMT+0800 (China Standard Time)

Sorry, Equation (6) and Equation (2).

Yixiao Ge · Answer 15 · Thu Apr 30 2020 22:19:22 GMT+0800 (China Standard Time)

Yes, Eq. (2) and Eq. (6) have the same function. Eq. (6) is just a hard-version softmax-triplet loss, which is also supervised by a hard label 0/1. The CORE idea of our paper is Eq. (8), which is a soft-version softmax-triplet loss for supporting mean-teaching. We introduce Eq. (6), because the conventional hard-version triplet loss Eq. (2) does not have a soft-version variant to support mean-teaching.