P40 with USE_FLASH_ATTENTION=False
ltm920716 opened this issue · comments
System Info
Linux k8s-node2 6.5.0-41-generic #41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
NVIDIA-SMI 535.171.04 Driver Version: 535.171.04
P40
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
sudo nerdctl run -it --gpus '"device=1"' -e USE_FLASH_ATTENTION=False -p 8080:80 -v /data/models:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id /data/Meta-Llama-3-8B-Instruct
log
====>
2024-06-20T10:13:43.100144Z INFO text_generation_launcher: Args {
model_id: "/data/Meta-Llama-3-8B-Instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "0.0.0.0",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
}
2024-06-20T10:13:43.100237Z INFO text_generation_launcher: Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191`.
2024-06-20T10:13:43.100246Z INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-06-20T10:13:43.100250Z INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-06-20T10:13:43.100252Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-06-20T10:13:43.100257Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-20T10:13:43.100359Z INFO download: text_generation_launcher: Starting download process.
2024-06-20T10:13:46.674563Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-06-20T10:13:47.304041Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-20T10:13:47.304397Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-20T10:13:51.055838Z WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.
2024-06-20T10:13:52.009743Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve
from text_generation_server import server
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 17, in <module>
from text_generation_server.models.pali_gemma import PaliGemmaBatch
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/pali_gemma.py", line 5, in <module>
from text_generation_server.models.vlm_causal_lm import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
from text_generation_server.models.flash_mistral import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in <module>
from text_generation_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
from text_generation_server.utils import paged_attention, flash_attn
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 13, in <module>
raise ImportError("`USE_FLASH_ATTENTION` is false.")
ImportError: `USE_FLASH_ATTENTION` is false.
rank=0
2024-06-20T10:13:52.107484Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-20T10:13:52.107511Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
ImportError: USE_FLASH_ATTENTION
is false.
Expected behavior
run normally on P40
help please, thanks!
Hey @ltm920716, thanks for the report!
I believe this was fixed by the following PR #1986 from @Narsil.
You'll now get this instead:
WARN text_generation_launcher: Could not import Flash Attention enabled models: `USE_FLASH_ATTENTION` is false.
I believe this hasn't been released in a version yet, but you can use the docker image that was released with the commit of the PR above:
ghcr.io/huggingface/text-generation-inference:sha-06edde9
The resulting command would be:
sudo nerdctl run -it --gpus '"device=1"' -e USE_FLASH_ATTENTION=False -p 8080:80 -v /data/models:/data ghcr.io/huggingface/text-generation-inference:sha-06edde9 --model-id /data/Meta-Llama-3-8B-Instruct
thanks a lot, I will have a try!