transformers 4.38.1 gives bad llama3 performance on MTL iGPU

Question

transformers 4.38.1 gives bad llama3 performance on MTL iGPU

Cbaoj opened this issue a month ago · comments

I'm running llama3 inference on a MTL Core Ultra 7 1003H iGPU on Ubuntu 2204. This link https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama3 is followed and generate.py is used. The complete script is:

source /opt/intel/oneapi/setvars.sh
export SYCL_CACHE_PERSISTENT=1
export BIGDL_LLM_XMX_DISABLED=1

python ./generate.py --repo-id-or-model-path 'meta-llama/Meta-Llama-3-8B-Instruct' --prompt "some-1024-token-length-input" --n-predict 128

I got very different performance on two transformers versions.
#transformers==4.37
Inference time: 11.824079990386963 s
#transformers==4.38.1
Inference time: 16.150665760040283 s

Can I understand why transformers 4.38.1 is getting much worse performance?

My ipex-llm version is 2.1.0b20240521, oneAPI 2024.1.

SONG Ge · Answer 1 · Thu May 30 2024 10:20:50 GMT+0800 (China Standard Time)

Hi @Cbaoj , you may use transformers==4.38.2 to get a better performance, we are working on optimizing llama model performance on 4.38.x.

Cbaoj · Answer 2 · Fri May 31 2024 07:31:49 GMT+0800 (China Standard Time)

Thanks @sgwhat, 4.38.2 gives ~20% better performance compared to 4.38.1, still worse than 4.37. looking forward to your full optimization.