how to train another model like "Face Portrait v1 "

Question

how to train another model like "Face Portrait v1 "

ruanjiyang opened this issue 3 years ago · comments

Could you let us know who to train another model like "Face Portrait v1 "?

as I know, Animegan2 is not for facial style transfer.. so I really want to know the detail steps to train facial model by using Animegan2.

Thanks very much!

bryandlee · Answer 1 · Mon Aug 02 2021 20:08:02 GMT+0800 (China Standard Time)

Those weights are trained using a pix2pix method similar to this: https://github.com/justinpinkney/toonify

input(face) → Animegan2 generator → output <==> target(portrait), loss = LPIPS + L2 + GAN

genericcooper · Answer 2 · Sun Aug 22 2021 00:29:29 GMT+0800 (China Standard Time)

Those weights are trained using a pix2pix method similar to this: https://github.com/justinpinkney/toonify
input(face) → Animegan2 generator → output <==> target(portrait), loss = LPIPS + L2 + GAN

Your model for Face Portrait v1 is stunning. can you give anymore details about it?

bryandlee · Answer 3 · Mon Aug 23 2021 06:54:43 GMT+0800 (China Standard Time)

Hi, what kind of details are you looking for?

Leocien · Answer 4 · Thu Sep 30 2021 14:49:41 GMT+0800 (China Standard Time)

Hi, what kind of details are you looking for?

I want to know that how do you generate your target portrait images? using the network blend method like https://github.com/justinpinkney/toonify? I have adopted the network blend method to reproduce the disney cartoon model, the facial region can be transferred to disney style well but the background is also changed drastically. How do you make sure the background similarity like your Face Portrait v1 model ?

Thanks!

bryandlee · Answer 5 · Fri Oct 01 2021 15:37:47 GMT+0800 (China Standard Time)

I used the same network blending method, but the implementation may differ.
Below is my own implementation for the official stylegan2 model

from training.networks import Generator
from copy import deepcopy
import math


def gather_params(G: Generator) -> dict:
    params = dict(
        [(res, {}) for res in G.synthesis.block_resolutions] + [("mapping", {})]
    )
    # G params: mapping.xxx / synthesys.b128.xxx
    for n, p in sorted(list(G.named_buffers()) + list(G.named_parameters())):
        if n.startswith("mapping"):
            params["mapping"][n] = p
        else:
            res = int(n.split(".")[1][1:])
            params[res][n] = p
    return params


def blend_models(G_low: Generator, G_high: Generator, swap_layer: int, blend_width: float = 3) -> Generator:
    params_low = gather_params(G_low)
    params_high = gather_params(G_high)

    for layer_idx, res in enumerate(G_low.synthesis.block_resolutions):
        x = layer_idx - swap_layer
        
        if blend_width is not None:
            assert blend_width > 0
            exponent = - x / blend_width
            y = 1 / (1 + math.exp(exponent))
        else:
            y = 1 if x > 0 else 0
            
        for n, p in params_high[res].items():
            params_high[res][n] = params_high[res][n] * y + params_low[res][n] * (1 - y)

    state_dict = {}
    for _, p in params_high.items():
        state_dict.update(p)

    G_mix = deepcopy(G_high)
    G_mix.load_state_dict(state_dict)
    return G_mix

Inputs and targets for the pix2pix training are generated as follows

G_blend = blend_models(G_low, G_high, swap_layer=swap_layer, blend_width=blend_width)

input  = G_low.synthesis(w, noise_mode="const")
target = G_blend.synthesis(w, noise_mode="const")

The strength of the stylization depends on swap_layer and blend_width, so you can use multiple blended models to generate multiple target images (for example, strongly stylized target for the facial area and weakly stylized target for the background) and fuse them using segmentation masks.

G_blend_face = blend_models(G_low, G_high, swap_layer=swap_layer_face, blend_width=blend_width_face)
G_blend_bg = blend_models(G_low, G_high, swap_layer=swap_layer_bg, blend_width=blend_width_bg)

input  = G_low.synthesis(w, noise_mode="const")

target_face = G_blend_face.synthesis(w, noise_mode="const")
target_bg = G_blend_bg.synthesis(w, noise_mode="const")
target = target_face * mask + target_bg * (1 - mask)

Hope this helps!

郑煜伟 · Answer 6 · Wed Nov 24 2021 12:10:55 GMT+0800 (China Standard Time)

Those weights are trained using a pix2pix method similar to this: https://github.com/justinpinkney/toonify
input(face) → Animegan2 generator → output <==> target(portrait), loss = LPIPS + L2 + GAN

great job and thanks for your sharing !

and I wonder some details:

since your face dataset is produced through fine-tuning stylegan, so the best way to use pretrained model 'Face Portrait v1' may be 'align face in ffhq-mode'?
what's input size for 'Face Portrait v1' when training, 1024 ?

bryandlee · Answer 7 · Wed Nov 24 2021 20:52:21 GMT+0800 (China Standard Time)

Yes, that's the reason for the ffhq-alignment in demo.ipynb
It's 512

郑煜伟 · Answer 8 · Wed Nov 24 2021 21:15:47 GMT+0800 (China Standard Time)

Yes, that's the reason for the ffhq-alignment in demo.ipynb

It's 512

got it!
thanks~

Leocien · Answer 9 · Sun Nov 28 2021 01:14:08 GMT+0800 (China Standard Time)

I used the same network blending method, but the implementation may differ. Below is my own implementation for the official stylegan2 model

from training.networks import Generator
from copy import deepcopy
import math


def gather_params(G: Generator) -> dict:
    params = dict(
        [(res, {}) for res in G.synthesis.block_resolutions] + [("mapping", {})]
    )
    # G params: mapping.xxx / synthesys.b128.xxx
    for n, p in sorted(list(G.named_buffers()) + list(G.named_parameters())):
        if n.startswith("mapping"):
            params["mapping"][n] = p
        else:
            res = int(n.split(".")[1][1:])
            params[res][n] = p
    return params


def blend_models(G_low: Generator, G_high: Generator, swap_layer: int, blend_width: float = 3) -> Generator:
    params_low = gather_params(G_low)
    params_high = gather_params(G_high)

    for layer_idx, res in enumerate(G_low.synthesis.block_resolutions):
        x = layer_idx - swap_layer
        
        if blend_width is not None:
            assert blend_width > 0
            exponent = - x / blend_width
            y = 1 / (1 + math.exp(exponent))
        else:
            y = 1 if x > 0 else 0
            
        for n, p in params_high[res].items():
            params_high[res][n] = params_high[res][n] * y + params_low[res][n] * (1 - y)

    state_dict = {}
    for _, p in params_high.items():
        state_dict.update(p)

    G_mix = deepcopy(G_high)
    G_mix.load_state_dict(state_dict)
    return G_mix

Inputs and targets for the pix2pix training are generated as follows

G_blend = blend_models(G_low, G_high, swap_layer=swap_layer, blend_width=blend_width)

input  = G_low.synthesis(w, noise_mode="const")
target = G_blend.synthesis(w, noise_mode="const")

The strength of the stylization depends on swap_layer and blend_width, so you can use multiple blended models to generate multiple target images (for example, strongly stylized target for the facial area and weakly stylized target for the background) and fuse them using segmentation masks.

G_blend_face = blend_models(G_low, G_high, swap_layer=swap_layer_face, blend_width=blend_width_face)
G_blend_bg = blend_models(G_low, G_high, swap_layer=swap_layer_bg, blend_width=blend_width_bg)

input  = G_low.synthesis(w, noise_mode="const")

target_face = G_blend_face.synthesis(w, noise_mode="const")
target_bg = G_blend_bg.synthesis(w, noise_mode="const")
target = target_face * mask + target_bg * (1 - mask)

Hope this helps!

thanks for your helpful suggestions !

Now I met another question..., how do you make sure the expression similarity between the generated input and target ? In my case, facial expression generated by the blended models is often changed compared to generated by the original ffhq model. Do you fix the mapping network parameters or some layers in generator when finetuning? if you did, would you like to share the details to me?

Thanks sincerely

Leocien · Answer 10 · Wed Dec 01 2021 19:31:54 GMT+0800 (China Standard Time)

@bryandlee I would be very grateful if I can get your reply ~

bryandlee · Answer 11 · Thu Dec 02 2021 06:51:29 GMT+0800 (China Standard Time)

@Leocien That's a tricky part, but here are some tips that could help.

Freezing does help. I freeze the mapping layer when finetuning for layer swapping models.
Use attribute encoder to explicitly force the attribute similarity between the original and translated images. This could be done in the stylegan finetuning stage or pix2pix training stage.
Use additional augmentations in pix2pix training stage to guide some low-level attribute preserving. For example, I apply multiple corruptions to the source image for robustness, but use the same color shift to both source and target images in order to keep the colors consistent.

rainsoulsrx · Answer 12 · Mon Dec 06 2021 16:05:04 GMT+0800 (China Standard Time)

@Leocien That's a tricky part, but here are some tips that could help.

Freezing does help. I freeze the mapping layer when finetuning for layer swapping models.

Use attribute encoder to explicitly force the attribute similarity between the original and translated images. This could be done in the stylegan finetuning stage or pix2pix training stage.

Use additional augmentations in pix2pix training stage to guide some low-level attribute preserving. For example, I apply multiple corruptions to the source image for robustness, but use the same color shift to both source and target images in order to keep the colors consistent.

Hi, when you train your pix2pix model , you use stylegan random datasets, that is fake data, and the trained model can be used in real pictures when test, do I understand right?

sautin1 · Answer 13 · Wed Mar 23 2022 19:46:12 GMT+0800 (China Standard Time)

Those weights are trained using a pix2pix method similar to this: https://github.com/justinpinkney/toonify
input(face) → Animegan2 generator → output <==> target(portrait), loss = LPIPS + L2 + GAN

Hi, @bryandlee! Thanks for your work!
I am trying to get qualitative results on my own synthesized dataset using your pipeline. I have a few questions about its second stage (training pix2pixHD model):

Did you replace both generator and discriminator in pix2pixHD pipeline with those from AnimeGANv2?
Did you use the discriminator feature matching loss (G_GAN_Feat)?
Did you use the VGG feature matching loss (G_VGG)?
What weights did you use for your losses?
Did you make any modifications to the default pix2pixHD pipeline (except for adding 2 new losses - LPIPS and L2 and replacing G and D)?
How large synthesized dataset did you use?