intel / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Repository from Github https://github.comintel/ipex-llmRepository from Github https://github.comintel/ipex-llm

在GPU上运行无法达到最佳的tokens

cyskdlx opened this issue · comments

机器配置:14代Core i9;
内存:64GB;
显卡:双intelA770 ,16G显存x2;
SSD:500GB

问题描述:单一账户做交互的时候,GPU处理数据,最大的tokens只有15/S左右,日志如下:

outputlog.txt

vllm在两张A770 16G独显 运行deepseek-r1-distill-qwen-32b,单一账户做交互推理的时候,理论上tokens有30token/S左右

Which Docker image are you using?

Closed
在镜像外面运行
sudo xpu-smi config -d x -t 0 --frequencyrange 2400,2400
然后再进入镜像运行模型推理