How max entropy of unlabelled data w.r.t classifier work?

Question

How max entropy of unlabelled data w.r.t classifier work?

opened this issue 5 years ago · comments

First, thanks for your sharing code!

Actually, I am little confused about How maximizing entropy of unlabelled data w.r.t classifier work? in the chapter 3.2 training objectives.
Firstly, you train the F (feature extractor) and C on labelled data, it is intuitive that the prototype of one class (i.e. class A) will locate at the centre of feature distributions of this class (i.e. class A), where the objective function is cross-entropy minimization.

In the second step, you mentioned that the maximizing entropy of unlabelled data w.r.t classifier will force/push the all the prototypes (representative points) to the feature distributions of target domain.
It's correct, but how can you ensure that prototype of Class A (initially at the centre of source Class A feature distribution) will be pushed to the centre of target Class A feature distribution?
Because the figures in your paper shows that the class-specific prototype (initially at the centre of source Class A feature distribution) will be pushed to class-specific centre of feature distribution in target domain.

Or doesn't need to do or cannot achieve class-specific updating? And then in next step, try to minimizing the entropy w.r.t feature extractor.

If so, do you think first minimizing entropy w.r.t feature extractor and then max it w.r.t classifier is better? or it doesn't matter because Minimax will be implemented alternatively.

It is easy and intuitive to understand that Minimax alternative training can refine the performance, but it is still confused how the first max entropy step work (push prototypes to specific target centre).

ksaito · Answer 1 · Thu Sep 12 2019 00:13:50 GMT+0800 (China Standard Time)

Thank you for your question.
We cannot ensure that the prototype of Class A will be pushed to the center of target Class A because most examples are unlabeled. The figure is to introduce the idea we have. But, as the performance of our method indicates, such ideal distribution alignment does not happen in many cases.
If we have many labeled examples, we can do class-specific updating, but we did not do that.
I think firstly maximizing entropy w.r.t classifier and minimizing w.r.t feature extractor makes more sense. However, when implementing the method, we need to do an alternative update, so I think the order of the two operations does not matter a lot.