intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

all-in-one benchmark llama-3-8b-instruct issue with version 2.1.0b1

Fred-cell opened this issue · comments

commented

batch 1, and 1024-512, it hung as below:
THE MYSTERY OF THE CITY](9781441125608_epub_itb-ch5.xhtml)
The man's journey took him to the heart of the city, where he discovered a hidden underground chamber filled with ancient artifacts and mysterious symbols. He spent hours studying the symbols, trying to decipher their meaning and unlock the secrets they held.
As he delved deeper into the chamber, he began to uncover a hidden history of the city, one that was shrouded in mystery and secrecy. He discovered that the city was built on an ancient site, one that was said to hold the power of the gods.
The man's journey took him to the city's ancient temple, where he discovered a hidden chamber filled with ancient artifacts and mysterious symbols. He spent hours studying the symbols, trying to decipher their meaning and unlock the secrets they held.
As he delved deeper into the chamber, he began to uncover a hidden history of the city, one that was shrouded in mystery and secrecy. He discovered that the city was built on an ancient site, one that was said to hold the power of the gods.
The man's journey took him to the city's ancient temple, where he discovered a hidden chamber filled with ancient artifacts and mysterious symbols. He spent hours studying the symbols, trying to decipher their meaning and unlock the secrets they held.
As he delved deeper into the chamber, he began to uncover a hidden history of the city, one that was shrouded in mystery and secrecy. He discovered that the city was built on an ancient site, one that was said to hold the power of the gods.
The man's journey took him to the city's ancient temple, where he discovered a hidden chamber filled with ancient artifacts and mysterious symbols. He spent hours studying the symbols, trying to decipher their meaning and unlock the secrets they held.
As he delved deeper into the chamber, he began to uncover a hidden history of the city, one that was shrouded in mystery and secrecy. He discovered that the city was built on an ancient site, one that was said to hold the power of the gods.
The man's journey took him to the city's ancient temple, where he discovered a hidden chamber filled with ancient artifacts and mysterious
2024-05-27 23:12:00,670 - WARNING - The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
2024-05-27 23:12:00,670 - WARNING - Setting pad_token_id to eos_token_id:128001 for open-end generation.

commented

for llama2-7b-chat-hf, error is as below
Traceback (most recent call last):
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/run.py", line 1835, in
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/run.py", line 75, in run_model
File "/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/run.py", line 484, in run_transformer_int4_gpu
OSError: [Errno 24] Too many open files: '/home/intel/LLM/ipex-llm/python/llm/dev/benchmark/all-in-one/transformer_int4_gpu-results-2024-05-27.csv'

Too many open files is a linux common configuration, if your ulimit -n is 1024, you can change your number of openfiles to 65536.
See our FAQ:
https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/FAQ/faq.html#too-many-open-files

Llama3 1k-512 batch=4 hangs on fred env after two trial in all-in-one benchmark. Memory is not fully used 13G/16G.

  • Some test case results:
    Llama3 1k-512 batch=3 run normally.
    Llama3 32-32 batch=4 run normally.
    Llama2-7b 1k-512 batch=4 run normally.
    Qwen1.5-7b 1k-512 batch=4 get similar problem as llama3

  • Print log when running Llama3 1k 512 batch=4, find the inference is not hanging but running slowly in third trial.

  • Test on DDR4 server arc01, Llama3 1k 512 batch=4 can run normally.