facebookresearch / DiT

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evaluation errors

zen-d opened this issue · comments

commented

@wpeebles Hi, when I follow the #8 's instructions to do the evaluation for DiT-XL/2, the following error pops

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.                                                              
(0) INVALID_ARGUMENT: activation input is not finite. : Tensor had NaN values                                                                          
[[{{node 2905231348_876199450/conv_2/CheckNumerics}}]]                                                                                 
         [[strided_slice_2/_5]]                                                                                                                 
  (1) INVALID_ARGUMENT: activation input is not finite. : Tensor had NaN values                                       
         [[{{node 2905231348_876199450/conv_2/CheckNumerics}}]]                                                                                 
0 successful operations.                                                
0 derived errors ignored.

Do you know how to solve that? Thanks.

Hi @zen-d. Hmm, I haven't ran into this issue before although I also haven't personally tried using the script from #8. You might want to check if the NaNs are in the arr_0 array of your saved .npz file or somewhere else. You shouldn't encounter any NaNs when sampling.

commented

@wpeebles Hi, I have checked the saved npz using np.isnan().any(), that returns False. So it is weird to see there error information. BTW, would you like to officially release the evaluation code, so that it would make the comparison fairer and easier?

Sounds like the issue could potentially be with your TensorFlow setup. Are you using TF 2.0+ on GPU? I think older versions aren't supported with ADM's evaluation repo. Are you using their requirements.txt file from here? To debug this it might be a good idea to download one of ADM's hosted .npz files (e.g., you could download their "ADM-G + ADM-U" stats which you can find in their README) and see if you still run into any issues.

commented

@wpeebles Thanks for your detailed instructions!

Yes, I followed the original requirements.txt file from ADM's official repo. The installed TF version is 2.0+, specifically, it is

> conda list | grep tensorflow
tensorflow                2.8.1           cuda102py38h32e99bf_0    conda-forge
tensorflow-base           2.8.1           cuda102py38ha005362_0    conda-forge
tensorflow-estimator      2.8.1           cuda102py38h4357c17_0    conda-forge
tensorflow-gpu            2.8.1           cuda102py38hf05f184_0    conda-forge

In addition, I try the admnet_guided_upsampled_imagenet256.npz they provided and see the same error.

Since ADM's stats are giving you the same error the issue seems very likely to be either with your TF environment or maybe hardware. Are you running the exact command from the ADM repo?

python evaluator.py VIRTUAL_imagenet256_labeled.npz admnet_guided_upsampled_imagenet256.npz

Not sure if it's helpful but here's my TF environment that runs without any issues:

> conda list | grep tensorflow
tensorflow                2.11.0                   pypi_0    pypi
tensorflow-estimator      2.11.0                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.29.0                   pypi_0    pypi

You might want to try creating a new environment from scratch using TF's installation instructions. You could also try running some other GPU TensorFlow example code snippets to make sure everything works. Unfortunately it's hard for me to give debugging advice since the issue is outside of the DiT repo