Evaluation errors

Question

Evaluation errors

zen-d opened this issue 2 years ago · comments

@wpeebles Hi, when I follow the #8 's instructions to do the evaluation for DiT-XL/2, the following error pops

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.                                                              
(0) INVALID_ARGUMENT: activation input is not finite. : Tensor had NaN values                                                                          
[[{{node 2905231348_876199450/conv_2/CheckNumerics}}]]                                                                                 
         [[strided_slice_2/_5]]                                                                                                                 
  (1) INVALID_ARGUMENT: activation input is not finite. : Tensor had NaN values                                       
         [[{{node 2905231348_876199450/conv_2/CheckNumerics}}]]                                                                                 
0 successful operations.                                                
0 derived errors ignored.

Do you know how to solve that? Thanks.

Bill Peebles · Answer 1 · Wed Feb 22 2023 08:53:09 GMT+0800 (China Standard Time)

Hi @zen-d. Hmm, I haven't ran into this issue before although I also haven't personally tried using the script from #8. You might want to check if the NaNs are in the arr_0 array of your saved .npz file or somewhere else. You shouldn't encounter any NaNs when sampling.

zen-d · Answer 2 · Wed Feb 22 2023 18:19:21 GMT+0800 (China Standard Time)

@wpeebles Hi, I have checked the saved npz using np.isnan().any(), that returns False. So it is weird to see there error information. BTW, would you like to officially release the evaluation code, so that it would make the comparison fairer and easier?

Bill Peebles · Answer 3 · Wed Feb 22 2023 18:39:09 GMT+0800 (China Standard Time)

Sounds like the issue could potentially be with your TensorFlow setup. Are you using TF 2.0+ on GPU? I think older versions aren't supported with ADM's evaluation repo. Are you using their requirements.txt file from here? To debug this it might be a good idea to download one of ADM's hosted .npz files (e.g., you could download their "ADM-G + ADM-U" stats which you can find in their README) and see if you still run into any issues.

zen-d · Answer 4 · Thu Feb 23 2023 10:14:49 GMT+0800 (China Standard Time)

@wpeebles Thanks for your detailed instructions!

Yes, I followed the original requirements.txt file from ADM's official repo. The installed TF version is 2.0+, specifically, it is

> conda list | grep tensorflow
tensorflow                2.8.1           cuda102py38h32e99bf_0    conda-forge
tensorflow-base           2.8.1           cuda102py38ha005362_0    conda-forge
tensorflow-estimator      2.8.1           cuda102py38h4357c17_0    conda-forge
tensorflow-gpu            2.8.1           cuda102py38hf05f184_0    conda-forge

In addition, I try the admnet_guided_upsampled_imagenet256.npz they provided and see the same error.

Bill Peebles · Answer 5 · Thu Feb 23 2023 12:44:39 GMT+0800 (China Standard Time)

Since ADM's stats are giving you the same error the issue seems very likely to be either with your TF environment or maybe hardware. Are you running the exact command from the ADM repo?

python evaluator.py VIRTUAL_imagenet256_labeled.npz admnet_guided_upsampled_imagenet256.npz

Not sure if it's helpful but here's my TF environment that runs without any issues:

> conda list | grep tensorflow
tensorflow                2.11.0                   pypi_0    pypi
tensorflow-estimator      2.11.0                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.29.0                   pypi_0    pypi

You might want to try creating a new environment from scratch using TF's installation instructions. You could also try running some other GPU TensorFlow example code snippets to make sure everything works. Unfortunately it's hard for me to give debugging advice since the issue is outside of the DiT repo