MeteoSwiss / ldcast

Latent diffusion for generative precipitation nowcasting

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

``eval_genforecast``: No output after "Sending batch ... "

p3jitnath opened this issue · comments

Hi!
Really sorry to go over the top with bugs now.
I tried the other scripts they ran fine. For the ldcast model (with 1 gpu), I find that the output is stuck after "Sending batch ...". I have tried looking into nvidia-smi and htop but it seems there is no activity. Is there an issue with the processes not able to join?

https://github.com/MeteoSwiss/ldcast/blob/b2829aeec135d9bd8ac8dd59f4830ed009b90eb5/scripts/eval_genforecast.py#LL90C5-L90C31

Sorry for not being able to look into this yesterday, I had other commitments for the whole day.

python eval_genforecast.py works for me.

I assume from your description that you don't see messages saying "LDM ready at 0" etc? This will happen if there is an exception before the model is loaded. And unfortunately there is no clear traceback when the exception happens in one of the spawned processes (so I should add better error printouts going forward). But I think the most likely problem is that the model file is not found. The model file path is indicated by the argument weights_fn in the function create_evaluation_ensemble in eval_genforecast.py. Could you check that this file exists in the right location?

If you do see the "LDM ready" messages, then the problem might be your GPU running out of memory. You might be able to deal with this by decreasing the batch size. E.g.

$ python eval_genforecast.py --batch_size=1

Hi @jleinonen, thanks for the suggestions.
I did double check the path (with the forecast_demo file) and it seems to be working.

However, I am still not at the "LDM ready at 0" stage yet for eval_genforecast yet. After "Sending batch x/y" there doesn't seem to be any output in the console after that. I did decrease the batch_size to 2 and num_samples to 8 but it does not seem to help.

I was also monitoring the processes using htop and it seems the forked processes don't seem to get any process time. I also added a debug print function just to ensure that the forked process is executing, it turned out that the execution did not enter the ldm_process block.

So you put a debug print at the beginning of ldm_process and it doesn't produce anything? Could you also put a print command just before the mp.spawn command that starts the ldm_process, to make sure that the execution gets there?

Sorry that this is getting difficult, since I'm not able to replicate the problem on my system...