intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Home Page:https://ipex-llm.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

transformers 4.38.1 gives bad llama3 performance on MTL iGPU

Cbaoj opened this issue · comments

commented

I'm running llama3 inference on a MTL Core Ultra 7 1003H iGPU on Ubuntu 2204. This link https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3 is followed and generate.py is used. The complete script is:

source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export BIGDL_LLM_XMX_DISABLED=1

python ./generate.py --repo-id-or-model-path 'meta-llama/Meta-Llama-3-8B-Instruct' --prompt "some-1024-token-length-input" --n-predict 128

I got very different performance on two transformers versions.
#transformers==4.37
Inference time: 11.824079990386963 s
#transformers==4.38.1
Inference time: 16.150665760040283 s

Can I understand why transformers 4.38.1 is getting much worse performance?

My ipex-llm version is 2.1.0b20240521, oneAPI 2024.1.

Hi @Cbaoj , you may use transformers==4.38.2 to get a better performance, we are working on optimizing llama model performance on 4.38.x.

commented

Thanks @sgwhat, 4.38.2 gives ~20% better performance compared to 4.38.1, still worse than 4.37. looking forward to your full optimization.