intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

all-in-one based version 2.1.0b1 issue for llama3-8b-instruct with 128-2048

Fred-cell opened this issue 2 months ago · comments

Fred commented 2 months ago

Last iterator is verfy slow for llama3-8b-instruct & fp8.

Cengguang Zhang commented 2 months ago

Cannot reproduce the issue in the env provided by user, will sync details offline.

Cengguang Zhang commented 2 months ago

After sync with user offline, cannot reproduce in users' env, will keep monitoring similar behavior, close issue for now.