RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Question

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

LiWentomng opened this issue a year ago · comments

Hello@jshilong, have you encountered this problem?

I have trained the model of both two stages. Then I merge the trained model with llama as you described.
When I load the merged model to do test, the errors in below occured.

Traceback (most recent call last):
  File "/hy/code/gpt4roi/train_net.py", line 326, in <module>
    launch(
  File "/hy/code/gpt4roi/detectron2/detectron2/engine/launch.py", line 84, in launch
    main_func(*args)
  File "/hy/code/gpt4roi/train_net.py", line 311, in main
    res = Trainer.test(cfg, model)
  File "/hy/code/gpt4roi/detectron2/detectron2/engine/defaults.py", line 617, in test
    results_i = inference_on_dataset(model, data_loader, evaluator)
  File "/hy/code/gpt4roi/detectron2/detectron2/evaluation/evaluator.py", line 158, in inference_on_dataset
    outputs = model(inputs)
  File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/hy/code/gpt4roi/gpt4roi.py", line 153, in get_output
    output_ids = self.model.generate(
  File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/workspace/conda_env/gpt4roi/lib/python3.10/site-packages/transformers/generation/utils.py", line 2562, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

LI Wentong · Answer 1 · Tue Oct 17 2023 23:49:03 GMT+0800 (China Standard Time)

I have tested to merge your provided debug weights, and then load model to do inference, the same error occured.

Shilong Zhang · Answer 2 · Wed Oct 18 2023 10:06:25 GMT+0800 (China Standard Time)

Hi, it seems there is a misunderstanding regarding the purpose of merging the weights. The reason for merging the weights is to obtain the actual LlaVA (due to the license of Llama1) in order to train the GPT4ROI model for both stages. It appears that you are loading and training the delta, and then merging the Llama, which is an incorrect process.

By the way, the debug weight I provided earlier is a merged weight of GPT4ROI for stage 2, which you can load and fine-tune directly

LI Wentong · Answer 3 · Wed Oct 18 2023 10:48:05 GMT+0800 (China Standard Time)

@jshilong Got it. Thanks for your reply.