intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

[Windows] Qwen1.5-7B 性能优化

juan-OY opened this issue a month ago · comments

juan-OY commented a month ago

运行过程中会对多个长输入回复总结，希望能在Qwen1.5-7B模型下进一步优化首字和rest token处理延时。