vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

vllm-project/vllm Issues

[RFC]: Add control panel support for vLLM
Updated a day ago1
[Bug]: llava inference result is wrong !
Closed a day ago23
[Doc]: Why is the PA kernel time cost in the decode phase optimized after turning on Prefix Caching?
Closed a day ago2
[Bug]: Unexpected Special Tokens in prompt_logprobs Output for Llama3 Prompt
Updated a day ago4
[Bug]: Shape error encountered in speculative decoding when `enable_lora=True`
Updated a day ago
[Feature]: add local_files_only parameter
Closed a day ago3
Qwen1.5-14B-Chat-GPTQ-Int4: quantization is not fully optimized yet. The speed can be slower than non-quantized models.
Closed a day ago1
[Feature]: how to make vllm to the model, which not in the support list
Closed a day ago2
[Bug]: Llama 3 - Out of memory - RTX 4060 TI
Closed a day ago1
[Feature]: Health check for restart policy
Updated a day ago1
[Doc]: can't fing serving_embedding.py
Closed a day ago2
[Bug]: `logprobs` is not compatible with the OpenAI spec
Updated a day ago1
[Usage]: distributed inference with kuberay
Updated a day ago
Benchmark: benchmark_throughput and benchmark_latency should be able to write output to JSON file.
Closed a day ago1
[Misc]: a question about chunked-prefill in flash-attn backends
Updated 2 days ago
[Misc]: How to Load an Already Instantiated Hugging Face Model into vLLM for Inference?
Closed 2 days ago1
[Bug]: No CUDA GPUs are available on 'CPU' use
Updated 2 days ago
[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?
Updated 2 days ago7
[Usage]: How to determine how many concurrent requests can be supported in an acceptable time duration with demo api server?
Updated 2 days ago
[Misc]: Assertion with no scription in vllm with DeepSeekMath 7b model, why, how to fix?
Updated 2 days ago
[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason.
Updated 2 days ago3
Can I still use FP8 E5M2 KV Cache if my GPU capability is less than 8.9?
Closed 2 days ago1
[Feature]: Support the OpenAI Batch Chat Completions file format
Closed 2 days ago
[Feature]: CI: Test on NVLink-enabled machine
Updated 2 days ago10
[New Model]: Google's Paligemma family of models
Updated 2 days ago1
[Bug]: Cache operations are not supported for Neuron backend.
Closed 2 days ago1
[Feature]: Build and publish Neuron docker image
Updated 3 days ago
[Bug]: Running vllm docker image with neuron fails
Updated 3 days ago
[Usage]: convert llava-v1.5-7b to liuhaotian/llava-v1.5-7b-hf format
Closed 3 days ago3
[Usage]: how to use run in mixed mode CPU/GPU (device_map="auto")
Updated 3 days ago
[Usage]: Passing image to the vllm api endpoint
Updated 3 days ago1
[Bug]: llava, output is truncated, not fully displayed
Closed 3 days ago1
[Performance]: Qwen 7b chat model, under 128 concurrency, the CPU utilization rate is 100%, and the GPU SM utilization rate is only about 60%-75%. Is it a CPU bottleneck?
Updated 3 days ago1
[Installation]: Stuck for two hours during the installation of vllm
Closed 5 days ago5
[Usage]: How to use tensor-parallel-size argument when deploy Llama3-8b with AsyncLLMEngine
Updated 3 days ago
[Bug]: deploy Phi-3-mini-128k-instruct AssertionError
Updated 3 days ago4
[Feature]: rope_scaling for qwen2
Updated 3 days ago
[Performance]: Will memcpy happen with distributed kv caches while decoding ?
Updated 3 days ago
[Performance]: how to test tensorrt-llm serving correctly
Closed 3 days ago1
[Bug]: Async engine hangs with 0.4.* releases
Updated 3 days ago3
[Performance]: Deepseek-v2 support
Closed 3 days ago1
Remove EOS token before passing the tokenized input to model
Updated 3 days ago
[Bug]: 'ArgumentHelper' has no attribute 'enable_prefix_caching'
Closed 4 days ago
[Bug]: ModelRegistry.load_model_cls() circular import error on llama-llava
Updated 4 days ago
[Doc]: Doc for using tensorizer_uri with LLM is incorrect
Updated 4 days ago1
[Bug]: RAM OOM Error Loading 480GB MoE Model Despite Fix in PR #1395
Updated 5 days ago1
[Bug]: multi-gpu for baichuan2-13B-Chat benchmark_serving
Updated 5 days ago
[Usage]: How to change the batch size when testing the throughput of VLLM by running benchmark_throughput
Updated 5 days ago
[Feature]: Host CPU Docker image on Docker Hub
Updated 6 days ago
[Feature]: could paged_attention_v1 support parameter 'attn_bias'
Updated 7 days ago