rinongal / textual_inversion

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

can't reproduce the results

andorxornot opened this issue · comments

commented

hi! i trained ldm with three images and the token "container":



training takes lasted a few hours, the loss jumps, but i got exactly the same result as without training:

the config is loaded correctly. are there any logs besides the loss?

What text are you using for inference?
Unless you changed the config, the placeholder word for your concept is *, so your sentences should be of the form: "a photo of *" (and not "a photo of a container")

commented

yeah, i used a photo of * prompt , but got the container

Can you please:
(1) Post your full inference command?
(2) Check your logs folder images to see if the samples_scaled_gs images look like your input data?

commented

hm, \logs\images...\testtube\version_0\media is empty for me, there are no images

train :

main.py
--data_root ./images
--base ./configs/latent-diffusion/txt2img-1p4B-finetune.yaml
-t
-n run_01
--actual_resume ./models/ldm/text2img-large/model.ckpt
--init_word container
--gpus 0

inference:

scripts/txt2img.py
--ddim_eta 0.0
--n_samples 3
--n_iter 2
--scale 10.0
--ddim_steps 50
--embedding_path ./logs/images2022-08-23T21-03-11_run_01/checkpoints/embeddings.pt
--ckpt ./models/ldm/text2img-large/model.ckpt
--prompt "a photo of *"

The images should be in your ./logs/images2022-08-23T21-03-11_run_01/images/ directory.
Either way, when you run txt2img, try to run with:
--embedding_path ./logs/images2022-08-23T21-03-11_run_01/checkpoints/embeddings_gs-5xxx.pt where 5xxx is whatever checkpoint you have there which is closest to 5k.

fyi - I've got it working and I'm very impressed - I'm interested to know how to boost quality / dimensions of output.. have to dig into docs.
HOW TO

I train all his photos as "cinematic"
python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml -t --actual_resume ../stable-diffusion/models/ldm/text2img-large/model.ckpt -n leavanny_attempt_one --gpus 0, --data_root "/home/jp/Downloads/ImageAssistant_Batch_Image_Downloader/www.google.com/gregory_crewdson_-_Google_Search" --init_word=cinematic
(I gave up at 10,000 training iterations.)

I can then prime it with

 photo of * 
 pixelart or * 
 watercolor of * 

python scripts/txt2img.py --ddim_eta 0.0 \
                          --n_samples 8 \
                          --n_iter 2 \
                          --scale 10.0 \
                          --ddim_steps 50 \
                          --embedding_path /home/jp/Documents/gitWorkspace/textual_inversion/logs/gregory_crewdson_-_Google_Search2022-08-24T23-09-43_leavanny_attempt_one/checkpoints/embeddings_gs-9999.pt \
                          --ckpt_path ../stable-diffusion/models/ldm/text2img-large/model.ckpt \
                          --prompt "pixelart of *"

a-photo-of-*
pixelart-of-*
watercolor-of-*

@johndpope Glad to see some positive results 😄
Regarding quality / dimensions: I'm still working on the Stable Diffusion port which will probably help with that. At the moment inversion is working fairly well, but I'm having some trouble finding a 'sweet spot' where editing (by reusing * in new prompts) works as expected. It might require moving beyond just parameter changes.

As a temporary alternative, you should be able to just invert these results into the stable diffusion model and let it come up with new variations at a higher resolution (using just 'a photo of *').

Hi, when I train the embedding and run the generation command, I can obtain samples that shares some high-level similarity with my training inputs, however, they still look quite different in details (far less similar than the demo images in paper). Given that the reconstruction is perfect, is there a way to control the variation and let the generated samples look more similar to the inputs? Thanks!

@XavierXiao First of all, just to make sure, you're using the LDM version, yes?

If that's the case, then you have several options:

  1. Re-invert with a higher learning rate (e.g. edit the learning rate in the config to 1.0e-2). The higher the learning rate, the higher the image similarity after editing, but more prompts will fail to change the image at all.
  2. Try to re-invert with another seed (using the --seed argument). Unfortunately sometimes the optimization just falls into a bad spot.
  3. Try the same prompt engineering tricks you'd try with text. For example, use the placeholder several times ("a photo of * on the beach. A * on the beach").

Other than that, you'll see in our paper that we report that the results are typically 'best of 16'. There are certainly cases where only 3-4 images out of a batch of 16 were 'good'. And of course like with all txt2img models, some prompts just don't work.

If you can show me some examples, I could maybe point you towards specific solutions.

Thanks! I am using LDM version, with default setting in readme. I will give a try on things you mentioned, especially the lr. Here are some examples. I am trying to invert some images in MVTec for industrial quality inspection, and I attached the input (some capsules) and generated samples at 5k steps. Does this look reasonable? The inputs have very few variations (they look very similar to each other), is that the possible cause?

inputs_gs-005000_e-000050_b-000000

samples_gs-005000_e-000050_b-000000

The one on the right is more or less what I'd expect to get. If you're still having bad results during training, then seed changes etc. probably won't help. Either increase LR, or have a look at the output images and see if there's still progress, in which case you can probably just train for more time.

I'll try a run myself and see what I can get.

@andorxornot This is what I get with your data:

Training outputs (@5k):

samples_scaled_gs-005000_e-000131_b-000022

Watercolor painting of *:

watercolor-painting-of-%2A

A photo of * on the beach:

a-photo-of-%2A-on-the-beach

@XavierXiao I cropped out and trained on these 2 samples from your image:
Picture1 Picture2

Current outputs @4k steps with default parameters:

samples_scaled_gs-004000_e-000160_b-000000

If you're using the default parameters but only 1 GPU, the difference might be because the LDM training script automatically scales LR by your number of GPUs and the batch size. Your effective LR is half of mine, which might be causing the difference. Can you try training with double the LR and letting me know if that improves things? If so, I might need to disable this scaling by default / add a warning to the readme.

Thanks for the reply. I am using two GPUs so that shouldn't be an issue. I tried larger LR but it is hard to say whether I obtain improvements. I can obtain similar results as yours. Obviously the resulting images are less realistic than the trash container examples in the same thread, so maybe the input images are less familiar for the LDM model.

Some maybe unrelated things:

  1. I got the following warning after every epoch, is that expected?
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  1. In the first epoch I got the following warning
home/.conda/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:59: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 20. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

I use the default config with bs=4, and I have 8 training images. Not sure what caused this.

  1. In one over-night run, it seems like if we don't manually kill the process, it will run 1000 epochs, which is the max of pytorch lighting. So the max_step = 6100 is not effective?

@XavierXiao Warnings should both be fine.
max_step: It should be working. I'll look into it.

commented

thanks for your tests! it seems that for one machine i had to raise the lr ten times

@andorxornot Well, if everything's working now, feel free to close the issue 😄 Otherwise let me know if you need more help

@rinongal I think I'm having a similar issue but not familiar with the format of the learning rate in order to increase it.

EDIT: noticed I'm getting "RuntimeWarning: You are using LearningRateMonitor callback with models that have no learning rate schedulers. Please see documentation for configure_optimizers method.
rank_zero_warn(" from pytorch lightning lr_monitor.py

Also this is trying to use the stable diffusion v1_finetune.yaml and my samples_scaled all just look like noise at and well after 5000 global steps. Loss is pretty much staying at =1 or 0.99

I'll create a new issue if need be.

@XodrocSO I think it might be worth a new issue, but when you open it could you please:

  1. Check the input and reconstruction images in your log directory to see that they look fine.
  2. Paste the config file you're using and let me know if you're using the official repo or some re-implementation and whether you changed anything else.
  3. Upload an example of your current samples_scaled results.

Hopefully that will be enough to get started on figuring out the problem :)

commented

@andorxornot Would it be convenient for you to share your images?

Thanks! I am using LDM version, with default setting in readme. I will give a try on things you mentioned, especially the lr. Here are some examples. I am trying to invert some images in MVTec for industrial quality inspection, and I attached the input (some capsules) and generated samples at 5k steps. Does this look reasonable? The inputs have very few variations (they look very similar to each other), is that the possible cause?

inputs_gs-005000_e-000050_b-000000

samples_gs-005000_e-000050_b-000000
What is the effect of the image you generated later?