yumingj / Talk-to-Edit

Code for Talk-to-Edit (ICCV2021). Paper: Talk-to-Edit: Fine-Grained Facial Editing via Dialog.

Home Page:https://www.mmlab-ntu.com/project/talkedit/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question on attribute predictor (classifiers' outputs and meaning of `attributes_5.json` dictionary)

chi0tzp opened this issue · comments

Hi,

I'm trying to use your pre-trained classifier for the five CelebA attributes that you use (Bangs, Eyeglasses, No_Beard, Smiling, Young). I'm building the model that you provide (the modified ResNet) using attributes_5.json and I load the weights given in eval_predictor.pth.tar.

As far as I can tell, for each of the above five attributes, you have a classification head. For instance, classifier32Smiling which, at the top of it, it has a linear layer with 6 outputs. This is determined by the sub-dictionary

"32": {
            "name": "Smiling",
            "value":[0, 1, 2, 3, 4, 5],
            "idx_scale": 1,
            "idx_bias": 0
        }

found in attributes_5.json. Similarly you build the rest of the classifiers. My question is why you use these value lists above (i.e., "value":[0, 1, 2, 3, 4, 5])? What are those classes?

I'd like to use this model for predicting a score for each of the five attributes for a batch of images. Would you think this is possible?

As a side note, the function that you use for post-processing the predictions, i.e., output_to_label, gives NaNs in many cases. This is due to high values in predictions (in my case), which lead exp(.) to get to Inf, and thus softmax be NaN. Just says, that you could shift the maximum prediction to be zero before calculating softmax.

Thank you!

Hi, thanks for your interest in our work!

For each attribute, we use 6 classes to represent the degrees, i.e., [0, 1, 2, 3, 4, 5]. You can refer to the definitions of these classes in the supplementary files. The annotations are from our CelebA-Dialog dataset.

The prediction model can be used for the prediction of a batch of images, but you need to slightly modify the code. Currently, when processing the output of the predictor, we just assume there is only one image in a batch.

For the NaN issue, in our case, we do not observe this problem.

Hi @yumingj, thanks for the quick response. I just found what the classes mean. But since I have you, could you confirm that you forward to the network facial images (i.e., as produced by the GAN for some latent code) and not cropped ones? Also, are the transformations just the following?

transforms.Compose([transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

This could be the (numerical) reason why I get weird numbers.

Thanks for your time in any case!

The facial images forwarded to the prediction network should be first normalized to [0, 1] (if the images are generated by StyleGAN, the range is [-1, 1]), and then normalize the image using the statistic you mentioned above.

You can also find the processing function here: https://github.com/yumingj/Talk-to-Edit/blob/9d3023a05a7d99978e042c78f46798e55ae82c09/models/utils.py#L46:5

Thanks!

@yumingj Excellent! Thank you very much!