SafeAILab / EAGLE

Official Implementation of EAGLE-1 and EAGLE-2

Home Page:https://arxiv.org/pdf/2406.16858

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:7)

Ishiki-Iroha opened this issue · comments

Hello, I have reproduced it on vicuna7b and llama7b-chat on 8 L40S, and the results are quite amazing:

vicuna-7b-v1.3:
speed 85.4037544546903
speed0 28.96434457451452
ratio 2.9485823245534903
llama-2-7b-chat:
speed 83.18835319185268
speed0 29.447255352096626
ratio 2.824995137821214

However, when I tried llama13b-chat, I encountered the following problem:

# python3 -m evaluation.gen_ea_answer_llama2chat --base-model-path /mnt/data3/LLaMa2-13B-chat-hf/LLaMa2-13B-chat-hf/ --ea-model-path /mnt/data3/models/EAGLE-llama2-chat-13B/ --model-id llama-2-13B-ea
Output to data/mt_bench/model_answer/llama-2-13B-ea-temperature-1.0.jsonl
Loading checkpoint shards: 100%|██████████████████████| 3/3 [00:22<00:00,  7.42s/it]
Check model training state: False
CUDA VISIBLE DEVICES: 0,1,2,3,4,5,6,7
Traceback (most recent call last):
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 477, in <module>
    run_eval(
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 150, in run_eval
    get_answers_func(
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 239, in get_model_answers
    output_ids, new_token, idx = ea_forward(
  File "/home/xxx/EAGLE/evaluation/gen_ea_answer_llama2chat.py", line 63, in ea_forward
    tree_logits, logits,hidden_state,sample_token = initialize_tree(
  File "/home/xxx/EAGLE/model/utils.py", line 164, in initialize_tree
    tree_logits, outputs, logits,hidden_state,sample_token = model(
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xxx/EAGLE/model/ea_model.py", line 143, in forward
    ea_logits = self.ea_layer.topK_genrate(hidden_states, input_ids, self.base_model.lm_head, logits_processor)
  File "/home/condaenv/.python_libs/conda_env/eagle/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/xxx/EAGLE/model/cnets.py", line 830, in topK_genrate
    select_index=topk_index[self.tree_buffer['tree_indices'][i]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:7)

When I run the following command it works well:

# python3 -m evaluation.gen_baseline_answer_llama2chat --base-model-path /mnt/data3/LLaMa2-13B-chat-hf/LLaMa2-13B-chat-hf/ --ea-model-path /mnt/data3/models/EAGLE-llama2-chat-13B/ --model-id llama-2-13B-base

When I use a single L40S to run, there is no problem. When testing the 13B model, did you use multiple cards or a single card? I want to ensure we have a clear understanding of the testing conditions.
llama-13B-chat on 1 L40S,result:
speed 67.24270358594262
speed0 22.81789898647827
ratio 2.9469279194280853

Did you use multiple cards or a single card?

We conducted the tests using 2x RTX 3090.

RuntimeError

Thank you for identifying this bug, it has now been fixed.